WO2022178969A1

WO2022178969A1 - Voice conversation data processing method and apparatus, and computer device and storage medium

Info

Publication number: WO2022178969A1
Application number: PCT/CN2021/090173
Authority: WO
Inventors: 申定潜
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-02-26
Filing date: 2021-04-27
Publication date: 2022-09-01
Also published as: CN112967725A

Abstract

The embodiments of the present application belong to the field of artificial intelligence, and relate to a voice conversation data processing method and apparatus, and a computer device and a storage medium. The method comprises: according to a triggered voice conversation data processing instruction, acquiring call voice information of the current call and a user label of a user in the current call; converting the call voice information and the user label into vector matrixes having weights; inputting the vector matrixes having weights into an emotion determination model to obtain a machine conversation emotion parameter; according to the machine conversation emotion parameter, carrying out voice adjustment on a pre-recorded standard conversation voice to obtain an adapted conversation voice, wherein voice adjustment comprises acoustic adjustment and modal particle adjustment; and carrying out a man-machine conversation on the basis of the adapted conversation voice. In addition, the present application further relates to blockchain technology, and a standard conversation voice can be stored in a blockchain. By means of the present application, the intelligence of man-machine voice conversation interaction is improved.

Description

Voice dialogue data processing method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on February 26, 2021 with the application number 202110218920.0 and the invention titled "Voice dialogue data processing method, device, computer equipment and storage medium", the entire content of which is approved by Reference is incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for processing voice dialogue data.

Background technique

With the development of computer technology, artificial intelligence (AI) has become more and more widely used. Human-machine dialogue is an important part in the field of artificial intelligence and has rich application scenarios. For example, in the field of collection, artificial intelligence can be introduced for AI voice collection, which can reduce labor costs.

However, the current human-machine dialogue technology lacks the processing of speech data, and machine speech uses a fixed set of speech libraries. The voice library is usually recorded by professional announcers, and the pursuit of voice is to be honest and decent. However, the inventor realized that this kind of voice library is relatively rigid, and it appears to be the same in the face of different user objects and usage scenarios, which makes the user experience poor and the human-machine voice dialogue interaction is not intelligent enough.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to provide a voice dialogue data processing method, device, computer equipment and storage medium, so as to solve the problem that the human-machine voice dialogue interaction is not intelligent enough.

In order to solve the above technical problems, the embodiments of the present application provide a method for processing voice dialogue data, which adopts the following technical solutions:

According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;

Converting the call voice information and the user label into a vector matrix with weights;

Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue is performed based on the adapted dialogue voice.

In order to solve the above technical problems, the embodiments of the present application also provide a voice dialogue data processing device, which adopts the following technical solutions:

an acquisition module, configured to acquire the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction;

a conversion module for converting the call voice information and the user label into a vector matrix with weights;

a matrix input module for inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

A voice adjustment module, configured to perform voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue module is used to conduct human-machine dialogue based on the adapted dialogue voice.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:

A human-machine dialogue is performed based on the adapted dialogue voice.

In order to solve the above technical problems, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:

A human-machine dialogue is performed based on the adapted dialogue voice.

Compared with the prior art, the embodiment of the present application mainly has the following beneficial effects: after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information. ;Convert the call voice information and user labels into a vector matrix with weights. The vector matrix integrates the voice characteristics of the user during the call and the user's personal information. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The dialogue emotion parameter represents the emotion category and intensity that the machine should adopt. According to the machine dialogue emotion parameter, the standard dialogue speech is adjusted acoustically and the tone particle is adjusted to obtain the adapted dialogue speech, which realizes the human-machine dialogue according to the user's dialogue emotion. It selects the dialogue emotions and personal information in a targeted manner, which improves the intelligence of human-machine voice dialogue interaction.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of an embodiment of a method for processing voice dialogue data according to the present application;

3 is a schematic structural diagram of an embodiment of a voice dialogue data processing apparatus according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

As shown in FIG. 1 , the system architecture 100 may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the

terminal devices

101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

The

terminal devices

101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.

The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the

terminal devices

101 , 102 , and 103 .

It should be noted that the voice dialogue data processing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the voice dialogue data processing apparatus is generally set in the server.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

Continuing to refer to FIG. 2 , a flowchart of an embodiment of a method for processing voice dialogue data according to the present application is shown. The described voice dialogue data processing method comprises the following steps:

Step S201 , according to the triggered voice dialogue data processing instruction, acquire the call voice information of the current call and the user tag of the user in the current call.

In this embodiment, the electronic device (for example, the server shown in FIG. 1 ) on which the voice dialogue data processing method runs may communicate with the terminal through a wired connection or a wireless connection. It should be pointed out that the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .

Wherein, the voice dialogue data processing instruction may be an instruction instructing the server to perform data processing on the call voice information. User tags can be derived from pre-established user portraits. The user portraits record many tags of the user and describe the basic information of the user. In the collection scenario, the user's credit evaluation score can also be obtained, and the credit evaluation score can also be used as a user label.

Specifically, during the man-machine dialogue, after collecting the instant call voice information, the terminal generates a voice dialogue data processing instruction and sends it to the server, and the server obtains the call voice information of the current call according to the voice dialogue data processing instruction. A man-machine dialogue system is set in the terminal, which can realize man-machine dialogue under the control of the server.

When the man-machine dialogue is started, the server also obtains the user ID of the user, and queries the user tag from the database according to the user ID. While acquiring the voice information of the call, the server can also acquire the user tag, and process the voice dialogue data according to the voice information of the call and the user tag.

Step S202, converting the call voice information and the user label into a vector matrix with weights.

Specifically, the server may extract speech feature parameters from the voice information of the call to obtain a feature parameter matrix.

Speech feature parameter is a parameter extracted from speech, which is used to analyze the tone and emotion of speech. In order to imitate the real human voice during human-computer dialogue, it is necessary to obtain the speech feature parameters of the training corpus. The speech feature parameters can reflect the prosodic features of the speech, and the prosodic features determine where the speech needs to pause, how long to pause, which word or word Words need to be re-read, which words need to be read lightly, etc., to achieve high and low tortuous sounds, cadence.

The voice information of the call can be preprocessed first. First, the voice endpoint detection (Voice Activity Detection, VAD) is performed on the voice information of the call, and the long-term silence is identified and eliminated from the sound signal stream, and then the voice information of the call after the silence is eliminated is processed. Framing, dividing the sound into small segments, each segment is called a frame, the segmentation can be achieved by moving the window function, and there can be overlap between each frame.

Then extract feature parameters from the preprocessed voice information of the call. The feature parameters include Linear Prediction Coefficients (LPCC) and Mel Cepstral Coefficents (MFCC). The purpose of extracting feature parameters is to A frame of call speech information is converted into a multi-dimensional vector. The server may extract any one of the linear prediction cepstral coefficient and the Mel cepstral coefficient, and use the linear prediction cepstral coefficient or the Mel cepstral coefficient as the speech feature parameter.

When processing the user tags, it is necessary to quantify the user tags according to a preset quantification rule to obtain a user tag matrix.

Since the voice dialogue data is processed according to the call voice information and the user label at the same time, weights can be assigned to the feature parameter matrix and the user label matrix. Among them, the proportion of weight distribution can be preset, and can be flexibly adjusted according to actual needs. The feature parameter matrix with weights and the user label matrix form a vector matrix.

Step S203, the vector matrix with weights is input into the emotion judgment model to obtain the machine dialogue emotion parameters.

Among them, the emotion determination model is used to determine the emotion and its intensity that should be adopted by the human-machine dialogue system during the human-computer dialogue. The machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt during the human-machine dialogue.

Specifically, the emotion judgment model needs to be trained by the model in advance, and the emotion judgment model can convolve and pool the vector matrix and map it to the machine dialogue emotion parameter; that is, the emotion judgment model can Information and user labels, outputting machine dialogue sentiment parameters.

The machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt. It can be a numerical value. The entire value range of the dialogue emotion parameter is divided into intervals. Each interval corresponds to a dialogue emotion, such as mildness and caution. , radical, etc. Each emotion can also be divided into multiple intervals, each interval corresponding to the intensity of the emotion.

Step S204, performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment.

The standard collection voice may be a collection voice without emotion.

Specifically, a standard dialogue voice is pre-recorded in the server, and the standard dialogue voice can be recorded from a real voice without emotion. The server performs voice adjustment on the standard dialogue voice according to the machine dialogue emotional parameters, thereby changing the emotional tendency of the standard dialogue voice, and obtaining the adapted dialogue voice. Among them, voice adjustment includes acoustic adjustment and modal particle adjustment. Acoustic adjustment can change the acoustic characteristics of standard dialogue speech, modal particle adjustment can be splicing the voice containing modal particles in the standard dialogue voice, and modal particles can also change the pronunciation to a certain extent. emotional tendencies.

For example, in the voice collection scenario, when the user's personal credit status is poor and the user's attitude is poor during the man-machine dialogue, the dialogue emotion parameters with strong aggressive emotions will be output, and the adaptive dialogue with aggressive emotions will be obtained after voice adjustment. Voice for dialog effects such as warnings to users.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned standard dialogue voice, the above-mentioned standard dialogue voice can also be stored in a node of a blockchain. The server can obtain standard conversational speech from the nodes of the blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Step S205, a man-machine dialogue is performed based on the adapted dialogue voice.

Specifically, the server sends the adapted dialogue voice to the terminal, and the terminal plays the adapted dialogue voice to realize the man-machine dialogue. The adaptive dialogue voice is generated according to the user's dialogue emotions and personal information during the man-machine dialogue, and the voice emotion has a strong pertinence, which improves the intelligence of the man-machine voice dialogue interaction.

In this embodiment, after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information; The vector matrix of weights. The vector matrix combines the voice characteristics of the user and the personal information of the user. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The machine dialogue emotion parameters represent the emotion category that the machine should adopt. As well as the intensity, the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue. The intelligence of human-machine voice dialogue interaction.

Further, before the above step S201, it may also include: obtaining the user identifier in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction; obtaining the user label corresponding to the user identifier, and converting the user label into an initial vector matrix ; Input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotional parameters; perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotional parameters to obtain the initial adaptation dialogue voice; Dialogue, and voice monitoring of the man-machine dialogue to obtain the voice information of the current call.

The man-machine dialogue initiation instruction may be an instruction instructing the server to start the man-machine dialogue. At the beginning of the man-machine dialogue, the user has not started the dialogue, and there is no call voice information including the user's voice, and the server can take the lead in starting the man-machine dialogue.

Specifically, the server starts the man-machine dialogue according to the received man-machine dialogue start instruction. The user ID may be included in the man-machine dialogue initiation instruction. The server extracts the user ID, and obtains the user label of the user according to the user ID in the database.

The server converts the obtained user label into a user label matrix. Since there is no voice information of the call, the characteristic parameter matrix can be set to zero, thereby obtaining an initial vector matrix. The server inputs the initial vector matrix into the emotion judgment model, and the emotion judgment model generates initial dialogue emotion parameters according to the initial vector matrix.

The server obtains the initial standard dialogue voice, and the initial standard dialogue voice may be the voice that can be played by the machine when the man-machine dialogue is started, without emotion. The server performs voice adjustment on the initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.

The server sends the initial adaptation dialogue voice to the terminal, the terminal plays the initial adaptation dialogue voice to start the man-machine dialogue, and performs voice monitoring after the man-machine dialogue starts to obtain the call voice information of the current call. It can be understood that the initially adapted dialogue voice is an emotionally adapted voice obtained according to the user's personal information in the absence of call voice information.

In one embodiment, after receiving the man-machine dialogue start instruction, the server may also obtain the initial standard dialogue voice, and conduct the man-machine dialogue directly according to the initial standard dialogue voice. After the call voice information is obtained, the machine dialogue emotional parameters are calculated in real time according to the call voice information and the user tag.

In this embodiment, at the beginning of the man-machine dialogue, the initial dialogue emotion parameter can be obtained only according to the user label, and the initial standard dialogue voice can be adjusted according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice for the man-machine dialogue. , so that emotional tendencies can be added to the man-machine dialogue even when there is no voice information on the call.

Further, before the step of acquiring the user identifier in the man-machine dialogue initiation instruction according to the received man-machine dialogue initiation instruction, the method may further include: acquiring training corpus, the training corpus including user labels, historical dialogue materials and dialogue emotion parameters; Extract the speech feature parameters of the historical dialogue material; assign weights to the speech feature parameters and user labels to generate a vector matrix with weights; take the vector matrix with weights as model input, and use dialogue emotion parameters as model output. The emotion judgment model is trained to obtain an emotion judgment model.

The historical dialogue data can be obtained by manually filtering the stored dialogue data, and the historical dialogue data includes a first historical voice and a second historical voice, wherein the first historical voice can be the voice of the first user or the man-machine dialogue system, The second historical voice may be the voice of the second user in the conversation. In the filtered historical dialogue data, the first historical voice has a good match emotionally with the user information of the second user and the second historical voice. The dialog emotion parameter measures the emotion category of the first historical speech and the intensity of emotion.

Specifically, the training corpus can be obtained from the training corpus, and the training corpus includes user tags, historical dialogue materials, and dialogue emotion parameters. User labels, historical dialogue data and dialogue sentiment parameters in each training corpus are matched.

The speech endpoint detection can be performed on the historical dialogue data first, and then framed. Then, the speech characteristic parameters are extracted from the framed speech data, and the speech characteristic parameters include linear prediction cepstral coefficients LPCC and Mel cepstral coefficients MFCC. The server may extract any one of linear prediction cepstral coefficients and Mel cepstral coefficients.

The voice feature parameters extracted by the server include voice feature parameters of the first historical voice and voice feature parameters of the second historical voice. Since the present application is to determine the voice emotion and its intensity required for the dialogue with the user, the voice feature parameters from the second historical voice can be heavily considered, so the voice feature parameters of the second historical voice can have a larger weight. At the same time, the user tag also needs to be assigned a weight, that is, the weight can be shared by the voice feature parameter of the first historical voice, the voice feature parameter of the second historical voice, and the user tag. The assigned weights can be flexibly adjusted according to actual needs.

The weighted speech feature parameters and user labels can form a weighted vector matrix, and the weighted vector matrix is input into the initial emotion judgment model, and the dialogue emotion parameters are used as the expected output of the initial emotion judgment model. The vector matrix with weights is processed by the initial emotion judgment model, and the predicted label is output. The prediction label is a quantitative evaluation value used in the training phase, which is used to quantitatively evaluate the emotion and intensity that a human or machine should take when talking to a user.

The server calculates the model loss according to the predicted labels and the dialogue emotion parameters, adjusts the model parameters of the initial emotion judgment model with the goal of reducing the model loss, and re-inputs the vector matrix into the initial emotion judgment model after the parameter adjustment for iteration until the obtained model is obtained. When the loss is less than the preset loss threshold, the server stops iterating and obtains the emotion judgment model.

In this embodiment, after the training corpus is obtained, the speech feature parameters are extracted from the historical dialogue data of the training corpus, and weights are assigned to the speech feature parameters and user labels, so as to differentiate the contributions of the speech feature parameters and user labels to the dialog emotion parameters; The vector matrix with weights is used as the model input, and the dialogue emotion parameters are used as the model output to train the initial emotion judgment model, and an emotion judgment model that can accurately select emotions can be obtained.

Further, in one embodiment, before the step of obtaining the user identifier in the man-machine dialogue start instruction according to the received man-machine dialogue start instruction, the method further includes: in the Gpipe library, based on a genetic algorithm, by The training corpus trains the initial emotion judgment model to obtain the emotion judgment model.

Specifically, the initial emotion determination model may be a deep neural network (Deep Neural Networks, DNN). The neural network layers inside DNN can be divided into three categories, input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. All layers are fully connected.

In order to ensure the accurate training of the initial emotion judgment model, the initial emotion judgment model can be trained through the training corpus based on the evolutionary algorithm in the Gpipe library. Among them, Gpipe is a distributed machine learning, scalable pipeline parallel library that can learn giant deep neural networks. Gpipe is trained using synchronous stochastic gradient descent and pipeline parallelism, suitable for any DNN consisting of multiple consecutive layers. Gpipe trains larger models by deploying more accelerators, allowing the model to be partitioned across accelerators. Specifically, the model is divided and divided into different accelerators, and small batches are automatically split into smaller micro-batches. Efficient training of multiple accelerators, while gradients are consistently accumulated in micro-batches, so the number of partitions does not affect model quality. Gpipe supports deploying more accelerators to train larger models, and without adjusting hyperparameters, the model output results are more accurate and performance is improved.

Evolutionary algorithm is a general term for a class of algorithms. It is a search algorithm that simulates biological evolution mechanisms such as natural selection and genetics. One of them is genetic algorithm. All kinds of evolutionary algorithms are iterative algorithms in nature. Has the concepts of populations, individuals and codes. Among them: (1) population, which can be understood as several models; (2) individual, which can be understood as a certain model; (3) coding, which is to describe the object in computer language, such as the network structure with a fixed-length binary string express.

In the evolutionary algorithm, each generation of the next generation requires 3 steps, namely selection, crossover, and mutation:

(1) The selection process, to achieve is to select a better object from the group, such as a model with higher accuracy.

(2) Crossover process, which is to realize the information exchange of different excellent objects, such as the module exchange of two good models.

(3) Mutation process, which is a small change to the individual. Compared with the crossover process, it can introduce more randomness and help to jump out of the local optimal solution.

After the model is mutated, the model is evaluated through the adaptation function, and a better model is selected to remain until the final optimal model is obtained. The fitness function can be a loss function, which is used to measure the accuracy of the model calculation result.

In this embodiment, in the Gpipe library, the initial emotion determination model is trained based on the genetic algorithm, which ensures the accuracy of the emotion determination model obtained by training.

Further, the above step S205 may include: performing semantic analysis on the call voice information to obtain a semantic analysis result; selecting a standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice; The voice adjustment method of the dialogue voice, the voice adjustment method includes an acoustic adjustment method and a modal particle adjustment method; according to the voice adjustment method, the standard dialogue voice is adjusted to obtain an adapted dialogue voice.

Specifically, the server performs semantic analysis on the call voice information to obtain a semantic analysis result. You can first convert the call voice information into call text, and use the pre-trained intent recognition model to identify the call text to obtain the user's intent, and use the user's intent as the result of semantic analysis; you can also calculate the call text and the pre-stored templates. The text calculates the similarity, and takes the template text with the highest similarity and the similarity greater than the preset similarity threshold as the semantic analysis result.

There may be multiple pre-recorded standard dialogue voices, and different standard dialogue voices have different semantic meanings. The standard dialogue voice that matches the semantic analysis result can be selected from multiple pre-recorded standard dialogue voices.

Each machine emotional dialogue parameter has preset voice adjustment methods. The voice adjustment method refers to the method of adjusting the standard dialogue voice, including the acoustic adjustment method and the modal particle adjustment method. Among them, the acoustic adjustment method specifies the adjustment method of the acoustic feature information, including the energy concentration area, formant frequency, formant intensity and bandwidth representing the timbre, as well as the duration, fundamental frequency, and average voice power representing the prosody characteristics of speech. Adjustment. Modal particle adjustment mode specifies the way in which modal particles are added to standard dialogue speech.

The server performs voice adjustment on the pre-recorded standard dialogue voice according to the voice adjustment method, so as to change the emotional tendency of the standard dialogue voice, and obtain the adapted dialogue voice. For example, the emotional tendency of standard dialogue speech can be adjusted to be pleasant through voice adjustment. In the acoustic adjustment method, the pitch can be increased, the average voice power can be increased, etc.; " and other words.

In the application, due to the need to meet the timeliness, the evolutionary algorithm can no longer be used in the emotion judgment, so that the emotion of the machine dialogue can be adjusted immediately.

In this embodiment, semantic analysis is performed on the call voice information to select a standard dialogue voice that matches semantically, so as to ensure the semantic rationality of the human-machine dialogue; the voice adjustment method corresponding to the emotional parameter of the machine dialogue is queried, so as to adjust the standard dialogue according to the voice adjustment method. Acoustic adjustment and modal particle adjustment are performed on the dialogue voice, so as to obtain an adapted dialogue voice with emotion.

Further, after the above step S205, it may also include: importing the voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result; determining whether the current call requires manual intervention according to the intent recognition result; when the current call requires manual intervention to transfer the current call to the terminal logged in with the manual agent account.

Wherein, the intent recognition model may be a model for identifying user intent.

Specifically, the server can also detect and monitor the user's intent during the call, and identify the user's intent through a pre-trained intent recognition model. The server imports the call voice information of the current call into the pre-established intent recognition model. The intent recognition model can convert the call voice information into call text, perform semantic analysis on the call text, and output the intent recognition result.

When the intent recognition result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account, so that the manual agent can communicate with the user through the terminal.

For example, in the AI collection scenario, the voice that matches the user's emotions is selected for human-machine dialogue. When the user clearly shows a clear willingness to resist repayment in the dialogue, it can be considered that manual intervention is required to transfer the human-machine dialogue. To the terminal logged in by the manual agent account, the manual agent will intervene; or when the human-machine dialogue system cannot effectively answer the user's questions, the human-machine dialogue will be transferred to the terminal logged in with the manual agent account, so as to better provide dialogue services.

In this embodiment, intention detection is performed in the man-machine dialogue. When the intention detection result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-machine dialogue interaction.

Further, when the above-mentioned current call requires manual intervention, the step of transferring the current call to the terminal logged in with the manual agent account may include: when the current call requires manual intervention, acquiring the call voice information of the current call and the user of the user in the current call. label; call voice information is converted into call text; the current call is transferred to the terminal logged in with the artificial agent account, and the call text and user label are sent to the terminal for display.

Specifically, when the server determines that the current call needs manual intervention, it converts the call voice information of the current call into call text, and obtains the user's user tag; when transferring the current call to the terminal logged in with the manual agent account, the call text The user tag is sent to the terminal, so that the artificial agent can instantly understand the context information of the dialogue and the basic information of the user without having to re-communicate, improving the efficiency and intelligence of dialogue interaction.

In this embodiment, when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue. Efficiency and intelligence.

The voice dialogue data processing method in this application relates to neural networks, machine learning and voice processing in the field of artificial intelligence; in addition, it can also relate to smart life in the field of smart cities.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , the computer-readable instructions, when executed, may include the processes of the above-mentioned method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

With further reference to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of an apparatus for processing voice dialogue data. The apparatus embodiment corresponds to the method embodiment shown in FIG. 2 . The apparatus Specifically, it can be applied to various electronic devices.

As shown in FIG. 3 , the voice dialogue data processing device 300 in this embodiment includes: an acquisition module 301, a conversion module 302, a matrix input module 303, a voice adjustment module 304, and a human-machine dialogue module 305, wherein:

The obtaining module 301 is configured to obtain the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction.

The conversion module 302 is configured to convert the call voice information and the user label into a vector matrix with weights.

The matrix input module 303 is used for inputting the vector matrix with weights into the emotion judgment model to obtain machine dialogue emotion parameters.

The speech adjustment module 304 is configured to perform speech adjustment on the pre-recorded standard dialogue speech according to the machine dialogue emotion parameter to obtain the adapted dialogue speech, wherein the speech adjustment includes acoustic adjustment and modal particle adjustment.

The man-machine dialogue module 305 is used to conduct man-machine dialogue based on the adapted dialogue voice.

In some optional implementations of this embodiment, the voice dialogue data processing apparatus 300 may further include: an identification acquisition module, a label acquisition module, an initial input module, an initial adjustment module, and an initial dialogue module, wherein:

The identification obtaining module is used for obtaining the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction.

The label obtaining module is used to obtain the user label corresponding to the user ID, and convert the user label into an initial vector matrix.

The initial input module is used to input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotion parameters.

The initial adjustment module is used to adjust the voice of the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.

The initial dialogue module is used to conduct a man-machine dialogue based on the initial adaptation dialogue voice, and monitor the voice of the man-machine dialogue to obtain the call voice information of the current call.

In some optional implementations of this embodiment, the apparatus 300 for processing voice dialogue data may further include: a training acquisition module, a parameter extraction module, a weight allocation module, and an initial training module, wherein:

The training acquisition module is used to acquire training corpus, and the training corpus includes user labels, historical dialogue materials and dialogue emotion parameters.

The parameter extraction module is used to extract the speech feature parameters of the historical dialogue material.

The weight assignment module is used to assign weights to speech feature parameters and user labels to generate a vector matrix with weights.

The initial training module is used to use the vector matrix with weights as the model input, and the dialogue emotion parameters as the model output to train the initial emotion judgment model to obtain the emotion judgment model.

In some optional implementations of this embodiment, the apparatus 300 for processing speech dialogue data may further include: a model training module, configured to: in the Gpipe library, based on a genetic algorithm, train the initial emotion judgment model through training corpora, Get the emotion judgment model.

In some optional implementations of this embodiment, the speech adjustment module 304 may include: a semantic parsing submodule, a standard selection submodule, a mode query submodule, and a speech adjustment submodule, wherein:

The semantic parsing sub-module is used to perform semantic parsing on the voice information of the call to obtain the semantic parsing result.

The standard selection sub-module is used to select the standard dialogue speech corresponding to the semantic analysis result from the pre-recorded standard dialogue speech.

The mode query sub-module is used to query the voice adjustment mode of the standard dialogue voice based on the machine dialogue emotion parameter, and the voice adjustment mode includes the acoustic adjustment mode and the modal particle adjustment mode.

The voice adjustment sub-module is used to adjust the standard dialogue voice according to the voice adjustment method to obtain the adapted dialogue voice.

In some optional implementations of this embodiment, the apparatus 300 for processing voice dialogue data may further include: an information import module, a call determination module, and a call transfer module, wherein:

The information import module is used to import the voice information of the current call into the pre-established intent recognition model to obtain the user intent recognition result.

The call determination module is used to determine whether the current call requires manual intervention according to the intention recognition result.

The call transfer module is used to transfer the current call to the terminal logged in with the manual agent account when the current call requires manual intervention.

In this embodiment, intention detection is performed in the man-machine dialogue. When the intention detection result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-computer dialogue interaction.

In some optional implementations of the present embodiment, the call transfer module may include: an acquisition submodule, an information conversion submodule, and a call transfer submodule, wherein:

The acquisition sub-module is used to acquire the call voice information of the current call and the user tag of the user in the current call when the current call requires manual intervention.

The information conversion submodule is used to convert the voice information of the call into the text of the call.

The call transfer sub-module is used to transfer the current call to the terminal logged in with the manual agent account, and send the call text and user label to the terminal for display.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.

The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can interact with the user through a keyboard, mouse, remote control, touch pad, or voice-activated device or the like for human-machine voice dialogue interaction.

The memory 41 includes at least one type of computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 . In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed in the computer device 4 , such as computer-readable instructions of a method for processing voice dialogue data. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions or process data stored in the memory 41, for example, computer-readable instructions for executing the voice dialogue data processing method.

The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.

The computer device provided in this embodiment can execute the above-mentioned voice dialogue data processing method. The voice dialogue data processing method here may be the voice dialogue data processing methods of the above embodiments.

The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the voice dialogue data processing method as described above.

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A method for processing voice dialogue data, comprising the following steps:

According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;

Converting the call voice information and the user label into a vector matrix with weights;

Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue is performed based on the adapted dialogue voice.
The method for processing voice dialogue data according to claim 1, wherein before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the method further comprises:

According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;

Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;

Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;

Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;

A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
The method for processing voice dialogue data according to claim 2, wherein before the step of acquiring the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the method further comprises:

acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;

extracting the speech feature parameters of the historical dialogue material;

Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;

The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
The method for processing voice dialogue data according to claim 2, wherein before the step of acquiring the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the method further comprises:

In the Gpipe library, based on the genetic algorithm, the initial emotion judgment model is trained through the training corpus, and the emotion judgment model is obtained.
The voice dialogue data processing method according to claim 1, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:

Perform semantic analysis on the voice information of the call to obtain a semantic analysis result;

Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;

Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;

According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
The method for processing voice dialogue data according to claim 1, wherein after the step of conducting a man-machine dialogue based on the adapted dialogue voice, the method further comprises:

Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;

Determine whether the current call requires manual intervention according to the intent recognition result;

When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
The voice dialogue data processing method according to claim 6, wherein, when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:

When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;

converting the call voice information into call text;

The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.
A voice dialogue data processing device, comprising:

an acquisition module, configured to acquire the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction;

a conversion module for converting the call voice information and the user label into a vector matrix with weights;

a matrix input module for inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

A voice adjustment module, configured to perform voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue module is used to conduct human-machine dialogue based on the adapted dialogue voice.
A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:

According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;

Converting the call voice information and the user label into a vector matrix with weights;

Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue is performed based on the adapted dialogue voice.
The computer device according to claim 9, wherein, before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the processor executes the When the computer-readable instructions are described, the following steps are also implemented:

According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;

Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;

Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;

Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;

A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
The computer device according to claim 10, wherein before the step of acquiring the user identifier in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the processor executes the computer-readable The command also implements the following steps:

acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;

extracting the speech feature parameters of the historical dialogue material;

Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;

The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
The computer equipment according to claim 9, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:

Perform semantic analysis on the voice information of the call to obtain a semantic analysis result;

Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;

Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;

According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
The computer device according to claim 9, wherein, after the step of conducting a human-machine dialogue based on the adapted dialogue voice, the processor further implements the following steps when executing the computer-readable instructions:

Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;

Determine whether the current call requires manual intervention according to the intent recognition result;

When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
The computer device according to claim 13, wherein when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:

When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;

converting the call voice information into call text;

The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.
A computer-readable storage medium on which computer-readable instructions are stored; wherein the computer-readable instructions are executed by a processor to achieve the following steps:

According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;

Converting the call voice information and the user label into a vector matrix with weights;

Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;

The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;

A human-machine dialogue is performed based on the adapted dialogue voice.
The computer-readable storage medium according to claim 15, wherein, before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the computer The readable instructions also implement the following steps when executed by the processor:

According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;

Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;

Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;

Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;

A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
The computer-readable storage medium according to claim 16, wherein before the step of acquiring the user identification in the man-machine dialogue initiation instruction according to the received man-machine dialogue initiation instruction, the computer-readable instruction is executed by The processor also implements the following steps when executing:

acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;

extracting the speech feature parameters of the historical dialogue material;

Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;

The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
The computer-readable storage medium according to claim 15, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:

Semantic analysis is performed on the voice information of the call to obtain a semantic analysis result;

Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;

Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;

According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
The computer-readable storage medium according to claim 15, wherein, after the step of conducting a human-machine dialogue based on the adapted dialogue voice, the computer-readable instruction further implements the following steps when executed by the processor:

Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;

Determine whether the current call requires manual intervention according to the intent recognition result;

When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
The computer-readable storage medium according to claim 19, wherein when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:

When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;

converting the call voice information into call text;

The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.