WO2022227188A1

WO2022227188A1 - Intelligent customer service staff answering method and apparatus for speech, and computer device

Info

Publication number: WO2022227188A1
Application number: PCT/CN2021/096981
Authority: WO
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-04-27
Filing date: 2021-05-28
Publication date: 2022-11-03
Also published as: CN112951215B; CN112951215A

Abstract

An intelligent speech customer service staff answering method and apparatus, and a computer device. The method comprises: acquiring a speech segment of a customer that includes a question (S1); inputting the speech segment into a speech encoder, so as to obtain an encoded first speech code (S2); performing timbre standardization processing on the first speech code, so as to obtain a second speech code (S3); inputting the second speech code into a speech decoder, so as to obtain an answer speech (S4); and sending the answer speech to the customer (S5). By means of synchronously training a speech encoder and a speech decoder on the basis of sample data, which is composed of a first speech segment of a customer raising a question, and a second speech segment corresponding to artificial customer service staff answering the question, during an artificial customer service, the corresponding answer speech can be obtained simply by acquiring the speech segment of the customer, without the need to convert the speech segment into text, thereby improving the accuracy and the calculation efficiency, and further improving customer satisfaction.

Description

Voice-based intelligent customer service answering method, device and computer equipment

This application claims the priority of the Chinese patent application filed on April 27, 2021 with the application number 202110462426.9 and the invention titled "Voice-based intelligent customer service answering method, device and computer equipment", the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a voice-based intelligent customer service answering method, device, and computer equipment.

Background technique

The traditional intelligent customer service Q&A system can be roughly divided into three independent parts. First, the voice recognition technology is used to identify the questioner's speech content and convert it into text, and then the text-level Q&A system is used to automatically generate a proposed answer based on the text of the question. The text is finally converted into voice output through the speech synthesis system. The inventor realized that this type of system relies on intermediate texts and requires multiple models to be used in a superimposed manner, and its accuracy will be affected by the superposition of multiple models, resulting in a low accuracy rate, and the process of computing through multiple models is cumbersome. It also leads to low efficiency.

technical problem

The main purpose of this application is to provide a voice intelligent customer service answering method, device and computer equipment, which aims to solve the problem that the traditional intelligent customer service question answering system relies on intermediate text and requires multiple models to be superimposed and used, resulting in low efficiency.

technical solutions

This application provides a voice intelligent customer service answering method, including:

Get the customer's voice snippet that contains the question;

Inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

Carrying out the timbre standardization process with the described first speech code, obtains the second speech code;

Input the second voice code into the voice decoder to obtain the answer voice; wherein, the voice encoder and the voice decoder are obtained through synchronous training, and the synchronous training method is to use the manual customer service service, the customer proposed The first voice fragment of the question is input into the voice encoder to be trained, and the timbre standardization process is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and the second voice corresponding to the manual customer service answering question are obtained. The speech segment is synchronously input into the speech decoder to be trained for training;

The answering voice is sent to the client.

The application also provides a voice intelligent customer service answering device, including:

An acquisition unit, which is used to acquire the voice segment that the customer contains in question;

a first input unit for inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

a processing unit, configured to perform timbre standardization processing on the first speech code to obtain a second speech code;

The second input unit is configured to input the second speech code into the speech decoder to obtain the answer speech; wherein, the speech encoder and the speech decoder are obtained through synchronous training, and the synchronous training method is to In the manual customer service, the first voice fragment of the question raised by the customer is input into the voice encoder to be trained, and the timbre standardization is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and artificial voice code are obtained. The second voice segment corresponding to the question answered by the customer service is synchronously input into the voice decoder to be trained for training;

a sending unit, configured to send the answering voice to the client.

The application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps of the intelligent customer service answering method by voice when the processor executes the computer program:

Get the customer's voice snippet that contains the question;

The answering voice is sent to the client.

The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the intelligent customer service answering method for voice are realized:

Get the customer's voice snippet that contains the question;

The answering voice is sent to the client.

beneficial effect

In the manual customer service service, the synchronous speech encoder and the speech decoder are trained based on the sample data composed of the first voice fragment of the customer's question and the second voice fragment corresponding to the manual customer service's answer, so that only the customer needs to obtain the problem that the customer contains the problem. It can realize the corresponding answer voice, realize the realization of voice-to-speech, simplify the intelligent customer service question and answer system, and do not need to convert the voice fragment into text, thus improving the accuracy and calculation efficiency, and further improving the customer satisfaction. In addition, the pre-trained voiceprint model is used to supervise the training of the answer voice, so that the generated timbre is unified, so that the customer experience effect is better.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

1 is a schematic flowchart of a voice-based intelligent customer service answering method according to an embodiment of the present application;

2 is a schematic structural block diagram of a voice intelligent customer service answering device according to an embodiment of the present application;

FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

Best Mode for Carrying Out the Application

1, the present application proposes a voice-based intelligent customer service answering method, including:

S1: Get the voice segment that the customer contains in question;

S2: inputting the speech fragment into the speech encoder to obtain the encoded first speech code;

S3: performing timbre standardization processing on the first speech code to obtain a second speech code;

S4: Input the second voice code into the voice decoder to obtain the answer voice; wherein, the voice encoder and the voice decoder are obtained through synchronous training, and the synchronous training method is to use the manual customer service service, The first voice fragment of the question raised by the customer is input into the voice encoder to be trained, and the timbre standardization process is carried out to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code corresponding to the question answered by the human customer service. The second speech segment is synchronously input into the speech decoder to be trained for training;

S5: Send the answering voice to the client.

As described in the above step S1, the voice segment containing the problem is obtained from the client. Among them, the voice clip occurs during the conversation between the intelligent customer service and the customer, that is, during the user's questioning process. The voice can be obtained by obtaining voice data transmitted from the mobile phone terminal. Specifically, after the mobile phone microphone collects the voice sent by the customer, the voice is sent to the terminal or server where the intelligent customer service is located.

As described in the above step S2, the voice segment is input into the voice encoder to obtain the encoded first voice code. The speech encoder can be any one of a waveform encoder, a vocoder, and a hybrid encoder, and can implement the encoding process of the speech segment. The answer to the voice fragment is not simply to restore the voice fragment. Therefore, the encoding needs to cooperate with the subsequent voice decoder, preferably the first recurrent neural network is used for encoding. Repeat.

As described in the above step S3, the timbre standardization process is performed on the first speech code to obtain the second speech code. Since there are many customers and customer service members in the sample data participating in the training, it is easy to cause the final generated answer voice to have incomplete timbre. Specifically, a pre-trained voiceprint model can be set up to supervise the answer voice generation process. That is, the pre-trained voiceprint model acts as a speaker encoder, constantly correcting the timbre in the answer voice, so that the final answer voice aligns with the speaker encoder, so as to complete the unification of the answer voice timbre. The function of concatenating strings into a string, that is, the pre-trained voiceprint model has trained voiceprint features, and the voiceprint features in the voiceprint model are generally expressed as strings, and the first speech code is directly String, of course, if the voiceprint feature is not a character string, you can digitize the voiceprint feature, that is, convert it into a corresponding number according to the size of the voiceprint, and then convert the voice into a character corresponding to the voiceprint feature Then, based on the concat function, the character string corresponding to the voiceprint feature and the first voice encoded character string are combined into one character string, wherein the concat function is used to connect the two character strings to form a single character string. That is, the second voice code contains both the character string corresponding to the voiceprint model and the character string corresponding to the first voice code. In the subsequent calculation process, there is no need to analyze the voiceprint features, ignoring the timbre information of the person, and only need to pay attention to Only the user's voice information can be used, and the generation of the answer voice can be focused on.

As described in the above step S4, the second speech code is input into the speech decoder to obtain the answer speech. Wherein, the speech encoder and the speech decoder are trained based on sample data consisting of a first speech segment in which a customer asks a question and a second speech segment corresponding to a question answered by the human customer service in the manual customer service service. The training method is to input the voice of the customer in the manual customer service service into the voice encoder, and perform timbre standardization processing to obtain the voice code corresponding to the first voice fragment, and input the corresponding voice code into the voice decoder, And input the answer answered by the corresponding human customer service into the voice decoder as an output correction, train the corresponding answer voice, and continuously adjust the parameters in the voice encoder and voice decoder, so that the answer voice is infinitely close to or equal to the answer answered by the human customer service. answer, thereby realizing the training of the voice decoder and the voice encoder, so that the corresponding answer voice can be obtained only by inputting the corresponding second voice code in the voice decoder.

As described in the above step S5, the answering voice is sent to the client. That is, send the answer voice to the customer to answer the customer's voice fragment, without the complicated process of speech recognition-intent recognition-speech synthesis, etc. For customers, it reduces the waiting time and has a better experience effect. For the server, the amount of computation is reduced and more computation space can be released.

In one embodiment, before the step S3 of performing timbre standardization processing on the first speech encoding to obtain the second speech encoding, the method further includes:

S201: Extract the first voiceprint feature in the voice segment;

S202: Calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

S203: Screen out the voiceprint model with the greatest similarity as a pre-trained voiceprint model according to the calculation result, so as to preprocess the first speech coding.

As described in the above steps S201-S203, the selection of the voiceprint model is realized. In order to adapt to customers in different regions and make customers feel close, a voiceprint model similar to the customer's timbre can be found. Specifically, the first voiceprint feature in the voice segment is first extracted, that is, the customer is first collected through the microphone cluster. The voiceprint of the customer is extracted from the voiceprint of the customer to obtain the first voiceprint feature. The extraction method can be Linear Prediction Coefficients (LPC), Perceptual Linear Predictive Coefficient (PLP), Tandem features and Any one of the Bottleneck features, calculate the similarity between the second voiceprint feature corresponding to each voiceprint model and the first voiceprint feature according to the similarity calculation formula, where the similarity calculation formula can be

in

represents the second voiceprint feature,

represents the first voiceprint feature,

Indicates the similarity between the first voiceprint feature and the second voiceprint feature, and then selects the voiceprint model with the greatest similarity as the pre-trained voiceprint model according to the calculation result, where the voiceprint model with the greatest similarity is the same as Using the voiceprint model that is most similar to the customer's voice as a pre-trained voiceprint model can improve the customer's favor and improve customer satisfaction. In addition, different voiceprint models are trained with different training data, such as dialects in different places, or languages in different age groups. In other embodiments, the calculation method of the similarity may also be Pearson Correlation Coefficient, Jaccard Coefficient, Tanimoto Coefficient (Generalized Jaccard Similarity Coefficient), log-likelihood similarity/logarithm Likelihood similarity, etc.

In one embodiment, the step S2 of inputting the speech segment into a speech encoder to obtain the encoded first speech encoding includes:

S211: in the speech encoder, preprocess the speech segment to obtain a speech signal; wherein the speech signal is a one-dimensional signal formed in time sequence;

S212: Perform compressed sensing processing on the one-dimensional signal according to the first predetermined formula to obtain a target feature signal;

S213: Input the target feature signal into a first recurrent neural network to obtain the first speech code.

As described in the above steps S211-S213, the acquisition of the first speech code is realized. That is to preprocess the shortening of the speech, wherein the preprocessing method is Linear Prediction Analysis (Linear Prediction Coefficients, LPC), Perceptual Linear Predictive Coefficient (Perceptual Linear Predictive, PLP), any one of Tandem features and Bottleneck features to obtain the corresponding A digital signal of a speech segment, i.e. a one-dimensional signal. Then, the compression is performed according to a first predetermined formula, wherein the first predetermined formula is t _i =p _i s _i , where t _i represents the compression value of the t-th signal point, and s _i represents the value of the i-th signal point in the speech segment , pi represents the compression value corresponding to the _ith signal point, which is related to s _i , that is, _{pi =f(s i} ₎ . The target feature signal is obtained, and then the target feature signal is input into the first recurrent neural network for processing to obtain the first speech code. The processing method will be described later, and will not be repeated here.

In one embodiment, the step S213 of inputting the target feature signal into the first recurrent neural network to obtain the first speech encoding includes:

S2131: In the hidden layer of the first recurrent neural network, encode each feature signal point of the target feature signal according to a second predetermined formula; wherein, the second predetermined formula is h(i)= σ[z(i)]=σ(Uz(i)+Wh(i-1)+b), σ is the activation function of the first recurrent neural network; b is the first linear offset coefficient; U is the the first linear relationship coefficient of the first recurrent neural network, W is the second linear relationship coefficient of the first recurrent neural network, z(i) represents the i-th feature signal point of the target feature signal, h (i) represents the coded value corresponding to the i-th feature signal point;

S2132: Sort the codes corresponding to each of the characteristic signal points according to the sequence of each of the characteristic signal points in the target characteristic signal to obtain the first speech code.

As described in the above steps S2131-S2132, that is, in the hidden layer of the first recurrent neural network, each feature point of the target feature signal is encoded according to the second predetermined formula, so that it is related to the value of the corresponding signal point, That is, the formula h(i)=σ[z(i)]=σ(Uz(i)+Wh(i-1)+b) is used for encoding, and h(i) represents the i-th characteristic signal point corresponding to the The coding value, h(i-1) represents the coding value corresponding to the i-1 th feature signal point, and is sorted according to the sequence of each feature signal point to obtain the first speech code. It should be noted that the second predetermined formula fully considers the value of the previous encoding, and uses the convolution method for encoding, so that the obtained data of the first speech encoding is more comprehensive, and the result of the calculation based on the first speech encoding is also It will be better. Specifically, the corresponding answer voice can refer to more parameters, and the obtained result will be more accurate.

In a real-time example, the step S4 of inputting the second speech code into the speech decoder to obtain the answering speech includes:

S401: Obtain the speech coding sequence in the second speech coding;

S402: Decode the speech coding sequence based on the second recurrent neural network to obtain a decoded intermediate feature signal;

S403: Obtain the answer voice according to a preset correspondence between the intermediate feature signal and the answer voice; wherein the preset correspondence is obtained by training with corresponding sample data.

As described in the above steps S401-S403, the parsing of the second speech code is realized, that is, the speech code sequence of the second speech code is obtained, mainly to obtain the first code in the second speech code. The voiceprint model is actually a way to control the timbre after the voice is generated, that is, it is first decoded through the second recurrent neural network. After decoding, the voice information of the corresponding voice fragment, that is, the intermediate feature signal, can be obtained. The speech decoders are all trained by the corresponding sample data, that is, input the corresponding question speech from the speech decoder to get the corresponding answer speech. The speech decoder also decodes the speech and converts it into the corresponding intermediate feature signal, in addition, there is a preset correspondence between the answer voice and the intermediate feature signal in the speech decoder, and the preset correspondence can be

Among them, a _i represents the i-th speech of the answering speech, b _ij represents the value corresponding to the j-th syllable of the i-th speech, and c _ij represents the weight corresponding to the j-th syllable of the i-th speech,

l represents the length of the voice, so as to obtain the corresponding answer voice.

In one embodiment, before the step S5 of sending the answer voice to the client, the method further includes:

S411: Extract the first voiceprint feature in the voice segment and the third voiceprint feature in the answer voice;

S412: Detect the similarity between the first voiceprint feature and the third voiceprint feature, and determine whether the similarity is greater than a similarity threshold;

S413: If it is greater than the similarity threshold, execute the step of sending the answer voice to the customer.

As described in the above steps S411-S413, the detection of the answer voice is realized, that is, the first voiceprint feature in the voice segment and the third voiceprint feature in the answer voice are first extracted. The extraction method is described above. The method of detecting similarity can still be calculated by the similarity calculation formula, and judge whether the similarity value is greater than the similarity threshold. The correction has played a role and can be sent to the customer. If it is less than or equal to the similarity threshold, it means that the corresponding effect has not been played. The timbre in the answer voice is quite different from the customer's timbre. At this time, you can choose whether to send it or not. Give the customer, or statistics, retrain the pretrained model so that the timbre of the answering speech can be similar to the timbre of the customer.

Beneficial effects of the present application: in the manual customer service service, the synchronous voice encoder and the voice decoder are trained based on the sample data composed of the first voice segment of the question raised by the customer and the second voice segment corresponding to the question answered by the manual customer service, so that only It is necessary to obtain the voice fragments containing questions from the customer, and then the corresponding answer voice can be obtained, which realizes the realization of voice-to-voice, and simplifies the intelligent customer service Q&A system without converting the voice fragments into text, thereby improving the accuracy and calculation. efficiency, thereby improving customer satisfaction. In addition, the pre-trained voiceprint model is used to supervise the training of the answer voice, so that the generated timbre is unified, so that the customer experience effect is better.

Referring to FIG. 2, the present application also provides a voice intelligent customer service answering device, including:

an acquisition unit 10, configured to acquire the voice segment that the customer contains in question;

The first input unit 20 is used for inputting the speech fragment into the speech encoder to obtain the encoded first speech code;

a processing unit 30, configured to perform timbre standardization processing on the first speech code to obtain a second speech code;

The second input unit 40 is configured to input the second speech code into the speech decoder to obtain the answer speech; wherein, the speech encoder and the speech decoder are obtained through synchronous training, wherein the synchronous training method In order to input the first voice segment of the question raised by the customer in the manual customer service service into the voice encoder to be trained, and perform timbre standardization processing to obtain the voice coding corresponding to the first voice fragment, the corresponding voice coding is performed. Synchronously input the second voice segment corresponding to the question answered by the human customer service into the voice decoder to be trained for training;

The sending unit 50 is configured to send the answer voice to the client.

In one embodiment, the voice intelligent customer service answering device further includes:

a voiceprint feature extraction unit, configured to extract the first voiceprint feature in the voice segment;

a computing unit, configured to calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

The screening unit is configured to screen out the voiceprint model with the greatest similarity as a pre-trained voiceprint model according to the calculation result, so as to perform timbre standardization processing on the first speech code.

In one embodiment, the first input unit 20 includes:

a preprocessing subunit, configured to preprocess the speech segment in the speech encoder to obtain a speech signal; wherein the speech signal is a one-dimensional signal formed in time sequence;

a compressed sensing processing subunit, configured to perform compressed sensing processing on the one-dimensional signal according to a first predetermined formula to obtain a target characteristic signal;

The feature signal input subunit is used for inputting the target feature signal into the first recurrent neural network to obtain the first speech code.

In one embodiment, the characteristic signal input subunit includes:

The encoding module is used to encode each feature signal point of the target feature signal in the hidden layer of the first recurrent neural network according to a second predetermined formula; wherein, the second predetermined formula is h( i)=σ[z(i)]=σ(Uz(i)+Wh(i-1)+b), σ is the activation function of the first recurrent neural network; b is the first linear offset coefficient; U is the first linear relationship coefficient of the first RNN, W is the second linear relationship coefficient of the first RNN, z(i) represents the i-th feature signal of the target feature signal point, h(i) represents the coded value corresponding to the i-th feature signal point;

The sorting module is configured to sort the codes corresponding to each of the characteristic signal points according to the order of each of the characteristic signal points in the target characteristic signal to obtain the first speech code.

In one embodiment, the second input unit 40 includes:

A coding sequence acquisition subunit, used for acquiring the speech coding sequence in the second speech coding;

a decoding subunit, configured to decode the speech coding sequence based on the second recurrent neural network to obtain a decoded intermediate feature signal;

The answer voice obtaining subunit is configured to obtain the answer voice according to the preset correspondence between the intermediate feature signal and the answer voice; wherein, the preset correspondence is obtained by training corresponding sample data.

In one embodiment, the intelligent customer service answering device for voice further includes:

A third voiceprint feature extraction unit, configured to extract the first voiceprint feature in the voice segment and the third voiceprint feature in the answer voice;

a similarity detection unit, configured to detect the similarity between the first voiceprint feature and the third voiceprint feature, and determine whether the similarity is greater than a similarity threshold;

An execution unit, configured to execute the step of sending the answer voice to the client if the similarity is greater than the similarity threshold.

Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store various voice data and the like. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, the voice-based intelligent customer service answering method described in any of the above embodiments can be implemented.

Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.

Embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, and the computer-readable storage medium may be non-volatile or volatile. When the computer program is executed by the processor, the voice-based intelligent customer service answering method described in any of the above embodiments can be implemented.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, device, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring. Among them, the user management module is responsible for the identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, and maintenance of the corresponding relationship between the user's real identity and blockchain address (authority management), etc. When authorized, supervise and audit the transactions of some real identities, and provide rule configuration for risk control (risk control audit); the basic service module is deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on valid requests, record them in the storage. For a new business request, the basic service first adapts the interface for analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and records are stored; the smart contract module is responsible for the registration and issuance of contracts, as well as contract triggering and contract execution. Developers can define contract logic through a programming language and publish to On the blockchain (contract registration), according to the logic of the contract terms, call the key or other events to trigger execution, complete the contract logic, and also provide the function of contract upgrade and cancellation; the operation monitoring module is mainly responsible for the deployment in the product release process , configuration modification, contract settings, cloud adaptation, and visual output of real-time status during product operation, such as: alarms, monitoring network conditions, monitoring node equipment health status, etc.

The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

A voice-based intelligent customer service answering method, comprising:

Get the customer's voice snippet that contains the question;

Inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

Carrying out the timbre standardization process with the described first speech code, obtains the second speech code;

Input the second voice code into the voice decoder to obtain the answer voice; wherein, the voice encoder and the voice decoder are obtained through synchronous training, and the synchronous training method is to use the manual customer service service, the customer proposed The first voice fragment of the question is input into the voice encoder to be trained, and the timbre standardization process is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and the second voice corresponding to the manual customer service answering question are obtained. The speech segment is synchronously input into the speech decoder to be trained for training;

The answering voice is sent to the client.
The intelligent customer service answering method of voice according to claim 1, wherein, before the step of performing timbre standardization processing on the first voice coding to obtain the second voice coding, further comprising:

extracting the first voiceprint feature in the speech segment;

Calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

According to the calculation result, a voiceprint model with the greatest similarity is selected as a pre-trained voiceprint model, so as to perform timbre standardization processing on the first speech code.
The intelligent customer service answering method of voice as claimed in claim 1, wherein the step of inputting the voice fragment into a voice encoder to obtain the encoded first voice code comprises:

In the speech encoder, the speech segment is preprocessed to obtain a speech signal; wherein, the speech signal is a one-dimensional signal formed in time sequence;

Perform compressed sensing processing on the one-dimensional signal according to the first predetermined formula to obtain a target characteristic signal;

The target feature signal is input into the first recurrent neural network to obtain the first speech code.
The voice-based intelligent customer service answering method according to claim 3, wherein the step of inputting the target feature signal into the first recurrent neural network to obtain the first voice coding comprises:

In the hidden layer of the first recurrent neural network, each feature signal point of the target feature signal is encoded according to a second predetermined formula; wherein, the second predetermined formula is h(i)=σ[ z(i)]=σ(Uz(i)+Wh(i-1)+b), σ is the activation function of the first recurrent neural network; b is the first linear offset coefficient; U is the first A first linear relationship coefficient of a recurrent neural network, W is the second linear relationship coefficient of the first recurrent neural network, z(i) represents the i-th feature signal point of the target feature signal, h(i ) represents the coded value corresponding to the i-th feature signal point;

According to the sequence of each of the characteristic signal points in the target characteristic signal, the codes corresponding to each of the characteristic signal points are sorted to obtain the first speech code.
The voice-based intelligent customer service answering method according to claim 1, wherein the step of inputting the second voice code into a voice decoder to obtain the answering voice comprises:

obtaining the speech coding sequence in the second speech coding;

Decode the speech coding sequence based on the second recurrent neural network to obtain a decoded intermediate feature signal;

The answer voice is obtained according to the preset correspondence between the intermediate feature signal and the answer voice; wherein the preset correspondence is obtained by training with corresponding sample data.
The voice-based intelligent customer service answering method according to claim 1, wherein before the step of sending the answering voice to the customer, the method further comprises:

extracting the first voiceprint feature in the voice segment and the third voiceprint feature in the answering voice;

Detecting the similarity between the first voiceprint feature and the third voiceprint feature, and judging whether the similarity is greater than a similarity threshold;

If it is greater than the similarity threshold, the step of sending the answer voice to the customer is performed.
A voice intelligent customer service answering device, comprising:

An acquisition unit, which is used to acquire the voice segment that the customer contains in question;

a first input unit for inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

a processing unit, configured to perform timbre standardization processing on the first speech code to obtain a second speech code;

The second input unit is configured to input the second speech code into the speech decoder to obtain the answer speech; wherein, the speech encoder and the speech decoder are obtained through synchronous training, and the synchronous training method is to In the manual customer service, the first voice fragment of the question raised by the customer is input into the voice encoder to be trained, and the timbre standardization is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and artificial voice code are obtained. The second voice segment corresponding to the question answered by the customer service is synchronously input into the voice decoder to be trained for training;

a sending unit, configured to send the answering voice to the client.
The voice intelligent customer service answering device according to claim 7, wherein the voice intelligent customer service answering device further comprises:

a voiceprint feature extraction unit, configured to extract the first voiceprint feature in the voice segment;

a computing unit, configured to calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

The screening unit is configured to screen out the voiceprint model with the greatest similarity as a pre-trained voiceprint model according to the calculation result, so as to perform timbre standardization processing on the first speech code.
A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein, when the processor executes the computer program, the steps of a voice-based intelligent customer service answering method are implemented:

Get the customer's voice snippet that contains the question;

Inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

Carrying out the timbre standardization process with the described first speech code, obtains the second speech code;

Input the second voice code into the voice decoder to obtain the answer voice; wherein, the voice encoder and the voice decoder are obtained through synchronous training, and the synchronous training method is to use the manual customer service service, the customer proposed The first voice fragment of the question is input into the voice encoder to be trained, and the timbre standardization process is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and the second voice corresponding to the manual customer service answering question are obtained. The speech segment is synchronously input into the speech decoder to be trained for training;

The answering voice is sent to the client.
The intelligent customer service answering method of voice as claimed in claim 9, wherein, before the step of performing timbre standardization processing on the first voice code to obtain the second voice code, the method further comprises:

extracting the first voiceprint feature in the speech segment;

Calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

According to the calculation result, a voiceprint model with the greatest similarity is selected as a pre-trained voiceprint model, so as to perform timbre standardization processing on the first speech code.
The intelligent customer service answering method of voice according to claim 9, wherein the step of inputting the voice fragment into a voice encoder to obtain the encoded first voice code comprises:

In the speech encoder, the speech segment is preprocessed to obtain a speech signal; wherein, the speech signal is a one-dimensional signal formed in time sequence;

Perform compressed sensing processing on the one-dimensional signal according to the first predetermined formula to obtain a target characteristic signal;

The target feature signal is input into the first recurrent neural network to obtain the first speech code.
The voice-based intelligent customer service answering method according to claim 11, wherein the step of inputting the target feature signal into the first recurrent neural network to obtain the first voice coding comprises:

In the hidden layer of the first recurrent neural network, each feature signal point of the target feature signal is encoded according to a second predetermined formula; wherein, the second predetermined formula is h(i)=σ[ z(i)]=σ(Uz(i)+Wh(i-1)+b), σ is the activation function of the first recurrent neural network; b is the first linear offset coefficient; U is the first A first linear relationship coefficient of a recurrent neural network, W is the second linear relationship coefficient of the first recurrent neural network, z(i) represents the i-th feature signal point of the target feature signal, h(i ) represents the coded value corresponding to the i-th feature signal point;

According to the sequence of each of the characteristic signal points in the target characteristic signal, the codes corresponding to each of the characteristic signal points are sorted to obtain the first speech code.
The intelligent customer service answering method of voice as claimed in claim 9, wherein the step of inputting the second voice code into a voice decoder to obtain the answering voice comprises:

obtaining the speech coding sequence in the second speech coding;

Decode the speech coding sequence based on the second recurrent neural network to obtain a decoded intermediate feature signal;

The answer voice is obtained according to the preset correspondence between the intermediate feature signal and the answer voice; wherein the preset correspondence is obtained by training with corresponding sample data.
The voice-based intelligent customer service answering method according to claim 9, wherein before the step of sending the answering voice to the customer, the method further comprises:

extracting the first voiceprint feature in the voice segment and the third voiceprint feature in the answering voice;

Detecting the similarity between the first voiceprint feature and the third voiceprint feature, and judging whether the similarity is greater than a similarity threshold;

If it is greater than the similarity threshold, the step of sending the answer voice to the customer is performed.
A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of a voice-based intelligent customer service answering method are implemented:

Get the customer's voice snippet that contains the question;

Inputting the speech fragment into a speech encoder to obtain the encoded first speech code;

Carrying out the timbre standardization process with the described first speech code, obtains the second speech code;

Input the second voice code into the voice decoder to obtain the answer voice; wherein, the voice encoder and the voice decoder are obtained through synchronous training, and the synchronous training method is to use the manual customer service service, the customer proposed The first voice fragment of the question is input into the voice encoder to be trained, and the timbre standardization process is performed to obtain the voice code corresponding to the first voice fragment, and the corresponding voice code and the second voice corresponding to the manual customer service answering question are obtained. The speech segment is synchronously input into the speech decoder to be trained for training;

The answering voice is sent to the client.
The intelligent customer service answering method of voice according to claim 15, wherein before the step of performing timbre standardization processing on the first voice code to obtain the second voice code, the method further comprises:

extracting the first voiceprint feature in the speech segment;

Calculate the similarity between the second voiceprint feature corresponding to each voiceprint model in the voiceprint model library and the first voiceprint feature;

According to the calculation result, a voiceprint model with the greatest similarity is selected as a pre-trained voiceprint model, so as to perform timbre standardization processing on the first speech code.
The voice-based intelligent customer service answering method according to claim 15, wherein the step of inputting the voice segment into a voice encoder to obtain the encoded first voice code comprises:

In the speech encoder, the speech segment is preprocessed to obtain a speech signal; wherein, the speech signal is a one-dimensional signal formed in time sequence;

Perform compressed sensing processing on the one-dimensional signal according to the first predetermined formula to obtain a target characteristic signal;

The target feature signal is input into the first recurrent neural network to obtain the first speech code.
The voice-based intelligent customer service answering method according to claim 17, wherein the step of inputting the target feature signal into the first recurrent neural network to obtain the first voice coding comprises:

In the hidden layer of the first recurrent neural network, each feature signal point of the target feature signal is encoded according to a second predetermined formula; wherein, the second predetermined formula is h(i)=σ[ z(i)]=σ(Uz(i)+Wh(i-1)+b), σ is the activation function of the first recurrent neural network; b is the first linear offset coefficient; U is the first A first linear relationship coefficient of a recurrent neural network, W is the second linear relationship coefficient of the first recurrent neural network, z(i) represents the i-th feature signal point of the target feature signal, h(i ) represents the coded value corresponding to the i-th feature signal point;

According to the sequence of each of the characteristic signal points in the target characteristic signal, the codes corresponding to each of the characteristic signal points are sorted to obtain the first speech code.
The voice-based intelligent customer service answering method according to claim 15, wherein the step of inputting the second voice code into a voice decoder to obtain the answering voice comprises:

obtaining the speech coding sequence in the second speech coding;

Decode the speech coding sequence based on the second recurrent neural network to obtain a decoded intermediate feature signal;

The answer voice is obtained according to the preset correspondence between the intermediate feature signal and the answer voice; wherein the preset correspondence is obtained by training with corresponding sample data.
The voice-based intelligent customer service answering method according to claim 15, wherein before the step of sending the answering voice to the customer, the method further comprises:

extracting the first voiceprint feature in the voice segment and the third voiceprint feature in the answering voice;

Detecting the similarity between the first voiceprint feature and the third voiceprint feature, and judging whether the similarity is greater than a similarity threshold;

If it is greater than the similarity threshold, the step of sending the answer voice to the customer is performed.