WO2022178969A1 - Voice conversation data processing method and apparatus, and computer device and storage medium - Google Patents

Voice conversation data processing method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2022178969A1
WO2022178969A1 PCT/CN2021/090173 CN2021090173W WO2022178969A1 WO 2022178969 A1 WO2022178969 A1 WO 2022178969A1 CN 2021090173 W CN2021090173 W CN 2021090173W WO 2022178969 A1 WO2022178969 A1 WO 2022178969A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
dialogue
call
user
machine
Prior art date
Application number
PCT/CN2021/090173
Other languages
French (fr)
Chinese (zh)
Inventor
申定潜
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022178969A1 publication Critical patent/WO2022178969A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for processing voice dialogue data.
  • AI artificial intelligence
  • Human-machine dialogue is an important part in the field of artificial intelligence and has rich application scenarios.
  • artificial intelligence can be introduced for AI voice collection, which can reduce labor costs.
  • the current human-machine dialogue technology lacks the processing of speech data, and machine speech uses a fixed set of speech libraries.
  • the voice library is usually recorded by professional announcers, and the pursuit of voice is to be honest and decent.
  • this kind of voice library is relatively rigid, and it appears to be the same in the face of different user objects and usage scenarios, which makes the user experience poor and the human-machine voice dialogue interaction is not intelligent enough.
  • the purpose of the embodiments of the present application is to provide a voice dialogue data processing method, device, computer equipment and storage medium, so as to solve the problem that the human-machine voice dialogue interaction is not intelligent enough.
  • the embodiments of the present application provide a method for processing voice dialogue data, which adopts the following technical solutions:
  • the triggered voice dialogue data processing instruction obtain the call voice information of the current call and the user tag of the user in the current call;
  • the voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
  • a human-machine dialogue is performed based on the adapted dialogue voice.
  • the embodiments of the present application also provide a voice dialogue data processing device, which adopts the following technical solutions:
  • an acquisition module configured to acquire the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction
  • a conversion module for converting the call voice information and the user label into a vector matrix with weights
  • a matrix input module for inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter
  • a voice adjustment module configured to perform voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
  • a human-machine dialogue module is used to conduct human-machine dialogue based on the adapted dialogue voice.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the triggered voice dialogue data processing instruction obtain the call voice information of the current call and the user tag of the user in the current call;
  • the voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
  • a human-machine dialogue is performed based on the adapted dialogue voice.
  • the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
  • the triggered voice dialogue data processing instruction obtain the call voice information of the current call and the user tag of the user in the current call;
  • the voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
  • a human-machine dialogue is performed based on the adapted dialogue voice.
  • the embodiment of the present application mainly has the following beneficial effects: after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information. ;Convert the call voice information and user labels into a vector matrix with weights.
  • the vector matrix integrates the voice characteristics of the user during the call and the user's personal information.
  • the emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters.
  • the dialogue emotion parameter represents the emotion category and intensity that the machine should adopt.
  • the standard dialogue speech is adjusted acoustically and the tone particle is adjusted to obtain the adapted dialogue speech, which realizes the human-machine dialogue according to the user's dialogue emotion. It selects the dialogue emotions and personal information in a targeted manner, which improves the intelligence of human-machine voice dialogue interaction.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a method for processing voice dialogue data according to the present application
  • FIG. 3 is a schematic structural diagram of an embodiment of a voice dialogue data processing apparatus according to the present application.
  • FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the voice dialogue data processing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the voice dialogue data processing apparatus is generally set in the server.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the described voice dialogue data processing method comprises the following steps:
  • Step S201 according to the triggered voice dialogue data processing instruction, acquire the call voice information of the current call and the user tag of the user in the current call.
  • the electronic device for example, the server shown in FIG. 1
  • the voice dialogue data processing method runs may communicate with the terminal through a wired connection or a wireless connection.
  • the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • the voice dialogue data processing instruction may be an instruction instructing the server to perform data processing on the call voice information.
  • User tags can be derived from pre-established user portraits. The user portraits record many tags of the user and describe the basic information of the user. In the collection scenario, the user's credit evaluation score can also be obtained, and the credit evaluation score can also be used as a user label.
  • the terminal After collecting the instant call voice information, the terminal generates a voice dialogue data processing instruction and sends it to the server, and the server obtains the call voice information of the current call according to the voice dialogue data processing instruction.
  • a man-machine dialogue system is set in the terminal, which can realize man-machine dialogue under the control of the server.
  • the server When the man-machine dialogue is started, the server also obtains the user ID of the user, and queries the user tag from the database according to the user ID. While acquiring the voice information of the call, the server can also acquire the user tag, and process the voice dialogue data according to the voice information of the call and the user tag.
  • Step S202 converting the call voice information and the user label into a vector matrix with weights.
  • the server may extract speech feature parameters from the voice information of the call to obtain a feature parameter matrix.
  • Speech feature parameter is a parameter extracted from speech, which is used to analyze the tone and emotion of speech. In order to imitate the real human voice during human-computer dialogue, it is necessary to obtain the speech feature parameters of the training corpus.
  • the speech feature parameters can reflect the prosodic features of the speech, and the prosodic features determine where the speech needs to pause, how long to pause, which word or word Words need to be re-read, which words need to be read lightly, etc., to achieve high and low tortuous sounds, cadence.
  • the voice information of the call can be preprocessed first.
  • the voice endpoint detection (Voice Activity Detection, VAD) is performed on the voice information of the call, and the long-term silence is identified and eliminated from the sound signal stream, and then the voice information of the call after the silence is eliminated is processed. Framing, dividing the sound into small segments, each segment is called a frame, the segmentation can be achieved by moving the window function, and there can be overlap between each frame.
  • the feature parameters include Linear Prediction Coefficients (LPCC) and Mel Cepstral Coefficents (MFCC).
  • LPCC Linear Prediction Coefficients
  • MFCC Mel Cepstral Coefficents
  • the purpose of extracting feature parameters is to A frame of call speech information is converted into a multi-dimensional vector.
  • the server may extract any one of the linear prediction cepstral coefficient and the Mel cepstral coefficient, and use the linear prediction cepstral coefficient or the Mel cepstral coefficient as the speech feature parameter.
  • weights can be assigned to the feature parameter matrix and the user label matrix.
  • the proportion of weight distribution can be preset, and can be flexibly adjusted according to actual needs.
  • the feature parameter matrix with weights and the user label matrix form a vector matrix.
  • Step S203 the vector matrix with weights is input into the emotion judgment model to obtain the machine dialogue emotion parameters.
  • the emotion determination model is used to determine the emotion and its intensity that should be adopted by the human-machine dialogue system during the human-computer dialogue.
  • the machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt during the human-machine dialogue.
  • the emotion judgment model needs to be trained by the model in advance, and the emotion judgment model can convolve and pool the vector matrix and map it to the machine dialogue emotion parameter; that is, the emotion judgment model can Information and user labels, outputting machine dialogue sentiment parameters.
  • the machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt. It can be a numerical value.
  • the entire value range of the dialogue emotion parameter is divided into intervals. Each interval corresponds to a dialogue emotion, such as mildness and caution. , radical, etc.
  • Each emotion can also be divided into multiple intervals, each interval corresponding to the intensity of the emotion.
  • Step S204 performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment.
  • the standard collection voice may be a collection voice without emotion.
  • a standard dialogue voice is pre-recorded in the server, and the standard dialogue voice can be recorded from a real voice without emotion.
  • the server performs voice adjustment on the standard dialogue voice according to the machine dialogue emotional parameters, thereby changing the emotional tendency of the standard dialogue voice, and obtaining the adapted dialogue voice.
  • voice adjustment includes acoustic adjustment and modal particle adjustment. Acoustic adjustment can change the acoustic characteristics of standard dialogue speech, modal particle adjustment can be splicing the voice containing modal particles in the standard dialogue voice, and modal particles can also change the pronunciation to a certain extent. emotional tendencies.
  • the dialogue emotion parameters with strong aggressive emotions will be output, and the adaptive dialogue with aggressive emotions will be obtained after voice adjustment.
  • Voice for dialog effects such as warnings to users.
  • the above-mentioned standard dialogue voice can also be stored in a node of a blockchain.
  • the server can obtain standard conversational speech from the nodes of the blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step S205 a man-machine dialogue is performed based on the adapted dialogue voice.
  • the server sends the adapted dialogue voice to the terminal, and the terminal plays the adapted dialogue voice to realize the man-machine dialogue.
  • the adaptive dialogue voice is generated according to the user's dialogue emotions and personal information during the man-machine dialogue, and the voice emotion has a strong pertinence, which improves the intelligence of the man-machine voice dialogue interaction.
  • the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information;
  • the vector matrix of weights The vector matrix combines the voice characteristics of the user and the personal information of the user.
  • the emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters.
  • the machine dialogue emotion parameters represent the emotion category that the machine should adopt.
  • the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue.
  • step S201 it may also include: obtaining the user identifier in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction; obtaining the user label corresponding to the user identifier, and converting the user label into an initial vector matrix ; Input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotional parameters; perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotional parameters to obtain the initial adaptation dialogue voice; Dialogue, and voice monitoring of the man-machine dialogue to obtain the voice information of the current call.
  • the man-machine dialogue initiation instruction may be an instruction instructing the server to start the man-machine dialogue.
  • the user has not started the dialogue, and there is no call voice information including the user's voice, and the server can take the lead in starting the man-machine dialogue.
  • the server starts the man-machine dialogue according to the received man-machine dialogue start instruction.
  • the user ID may be included in the man-machine dialogue initiation instruction.
  • the server extracts the user ID, and obtains the user label of the user according to the user ID in the database.
  • the server converts the obtained user label into a user label matrix. Since there is no voice information of the call, the characteristic parameter matrix can be set to zero, thereby obtaining an initial vector matrix.
  • the server inputs the initial vector matrix into the emotion judgment model, and the emotion judgment model generates initial dialogue emotion parameters according to the initial vector matrix.
  • the server obtains the initial standard dialogue voice, and the initial standard dialogue voice may be the voice that can be played by the machine when the man-machine dialogue is started, without emotion.
  • the server performs voice adjustment on the initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.
  • the server sends the initial adaptation dialogue voice to the terminal, the terminal plays the initial adaptation dialogue voice to start the man-machine dialogue, and performs voice monitoring after the man-machine dialogue starts to obtain the call voice information of the current call.
  • the initially adapted dialogue voice is an emotionally adapted voice obtained according to the user's personal information in the absence of call voice information.
  • the server may also obtain the initial standard dialogue voice, and conduct the man-machine dialogue directly according to the initial standard dialogue voice.
  • the machine dialogue emotional parameters are calculated in real time according to the call voice information and the user tag.
  • the initial dialogue emotion parameter can be obtained only according to the user label, and the initial standard dialogue voice can be adjusted according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice for the man-machine dialogue. , so that emotional tendencies can be added to the man-machine dialogue even when there is no voice information on the call.
  • the method may further include: acquiring training corpus, the training corpus including user labels, historical dialogue materials and dialogue emotion parameters; Extract the speech feature parameters of the historical dialogue material; assign weights to the speech feature parameters and user labels to generate a vector matrix with weights; take the vector matrix with weights as model input, and use dialogue emotion parameters as model output.
  • the emotion judgment model is trained to obtain an emotion judgment model.
  • the historical dialogue data can be obtained by manually filtering the stored dialogue data, and the historical dialogue data includes a first historical voice and a second historical voice, wherein the first historical voice can be the voice of the first user or the man-machine dialogue system, The second historical voice may be the voice of the second user in the conversation.
  • the first historical voice has a good match emotionally with the user information of the second user and the second historical voice.
  • the dialog emotion parameter measures the emotion category of the first historical speech and the intensity of emotion.
  • the training corpus can be obtained from the training corpus, and the training corpus includes user tags, historical dialogue materials, and dialogue emotion parameters. User labels, historical dialogue data and dialogue sentiment parameters in each training corpus are matched.
  • the speech endpoint detection can be performed on the historical dialogue data first, and then framed. Then, the speech characteristic parameters are extracted from the framed speech data, and the speech characteristic parameters include linear prediction cepstral coefficients LPCC and Mel cepstral coefficients MFCC.
  • the server may extract any one of linear prediction cepstral coefficients and Mel cepstral coefficients.
  • the voice feature parameters extracted by the server include voice feature parameters of the first historical voice and voice feature parameters of the second historical voice. Since the present application is to determine the voice emotion and its intensity required for the dialogue with the user, the voice feature parameters from the second historical voice can be heavily considered, so the voice feature parameters of the second historical voice can have a larger weight. At the same time, the user tag also needs to be assigned a weight, that is, the weight can be shared by the voice feature parameter of the first historical voice, the voice feature parameter of the second historical voice, and the user tag. The assigned weights can be flexibly adjusted according to actual needs.
  • the weighted speech feature parameters and user labels can form a weighted vector matrix, and the weighted vector matrix is input into the initial emotion judgment model, and the dialogue emotion parameters are used as the expected output of the initial emotion judgment model.
  • the vector matrix with weights is processed by the initial emotion judgment model, and the predicted label is output.
  • the prediction label is a quantitative evaluation value used in the training phase, which is used to quantitatively evaluate the emotion and intensity that a human or machine should take when talking to a user.
  • the server calculates the model loss according to the predicted labels and the dialogue emotion parameters, adjusts the model parameters of the initial emotion judgment model with the goal of reducing the model loss, and re-inputs the vector matrix into the initial emotion judgment model after the parameter adjustment for iteration until the obtained model is obtained.
  • the server stops iterating and obtains the emotion judgment model.
  • the speech feature parameters are extracted from the historical dialogue data of the training corpus, and weights are assigned to the speech feature parameters and user labels, so as to differentiate the contributions of the speech feature parameters and user labels to the dialog emotion parameters;
  • the vector matrix with weights is used as the model input, and the dialogue emotion parameters are used as the model output to train the initial emotion judgment model, and an emotion judgment model that can accurately select emotions can be obtained.
  • the method before the step of obtaining the user identifier in the man-machine dialogue start instruction according to the received man-machine dialogue start instruction, the method further includes: in the Gpipe library, based on a genetic algorithm, by The training corpus trains the initial emotion judgment model to obtain the emotion judgment model.
  • the initial emotion determination model may be a deep neural network (Deep Neural Networks, DNN).
  • DNN Deep Neural Networks
  • the neural network layers inside DNN can be divided into three categories, input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. All layers are fully connected.
  • the initial emotion judgment model can be trained through the training corpus based on the evolutionary algorithm in the Gpipe library.
  • Gpipe is a distributed machine learning, scalable pipeline parallel library that can learn giant deep neural networks.
  • Gpipe is trained using synchronous stochastic gradient descent and pipeline parallelism, suitable for any DNN consisting of multiple consecutive layers.
  • Gpipe trains larger models by deploying more accelerators, allowing the model to be partitioned across accelerators. Specifically, the model is divided and divided into different accelerators, and small batches are automatically split into smaller micro-batches. Efficient training of multiple accelerators, while gradients are consistently accumulated in micro-batches, so the number of partitions does not affect model quality.
  • Gpipe supports deploying more accelerators to train larger models, and without adjusting hyperparameters, the model output results are more accurate and performance is improved.
  • Evolutionary algorithm is a general term for a class of algorithms. It is a search algorithm that simulates biological evolution mechanisms such as natural selection and genetics. One of them is genetic algorithm. All kinds of evolutionary algorithms are iterative algorithms in nature. Has the concepts of populations, individuals and codes. Among them: (1) population, which can be understood as several models; (2) individual, which can be understood as a certain model; (3) coding, which is to describe the object in computer language, such as the network structure with a fixed-length binary string express.
  • each generation of the next generation requires 3 steps, namely selection, crossover, and mutation:
  • the selection process, to achieve is to select a better object from the group, such as a model with higher accuracy.
  • the fitness function can be a loss function, which is used to measure the accuracy of the model calculation result.
  • the initial emotion determination model is trained based on the genetic algorithm, which ensures the accuracy of the emotion determination model obtained by training.
  • step S205 may include: performing semantic analysis on the call voice information to obtain a semantic analysis result; selecting a standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;
  • the voice adjustment method of the dialogue voice includes an acoustic adjustment method and a modal particle adjustment method; according to the voice adjustment method, the standard dialogue voice is adjusted to obtain an adapted dialogue voice.
  • the server performs semantic analysis on the call voice information to obtain a semantic analysis result.
  • the text calculates the similarity, and takes the template text with the highest similarity and the similarity greater than the preset similarity threshold as the semantic analysis result.
  • the standard dialogue voice that matches the semantic analysis result can be selected from multiple pre-recorded standard dialogue voices.
  • the voice adjustment method refers to the method of adjusting the standard dialogue voice, including the acoustic adjustment method and the modal particle adjustment method.
  • the acoustic adjustment method specifies the adjustment method of the acoustic feature information, including the energy concentration area, formant frequency, formant intensity and bandwidth representing the timbre, as well as the duration, fundamental frequency, and average voice power representing the prosody characteristics of speech.
  • Modal particle adjustment mode specifies the way in which modal particles are added to standard dialogue speech.
  • the server performs voice adjustment on the pre-recorded standard dialogue voice according to the voice adjustment method, so as to change the emotional tendency of the standard dialogue voice, and obtain the adapted dialogue voice.
  • the emotional tendency of standard dialogue speech can be adjusted to be pleasant through voice adjustment.
  • the pitch can be increased, the average voice power can be increased, etc.; " and other words.
  • the evolutionary algorithm can no longer be used in the emotion judgment, so that the emotion of the machine dialogue can be adjusted immediately.
  • semantic analysis is performed on the call voice information to select a standard dialogue voice that matches semantically, so as to ensure the semantic rationality of the human-machine dialogue; the voice adjustment method corresponding to the emotional parameter of the machine dialogue is queried, so as to adjust the standard dialogue according to the voice adjustment method.
  • Acoustic adjustment and modal particle adjustment are performed on the dialogue voice, so as to obtain an adapted dialogue voice with emotion.
  • step S205 it may also include: importing the voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result; determining whether the current call requires manual intervention according to the intent recognition result; when the current call requires manual intervention to transfer the current call to the terminal logged in with the manual agent account.
  • the intent recognition model may be a model for identifying user intent.
  • the server can also detect and monitor the user's intent during the call, and identify the user's intent through a pre-trained intent recognition model.
  • the server imports the call voice information of the current call into the pre-established intent recognition model.
  • the intent recognition model can convert the call voice information into call text, perform semantic analysis on the call text, and output the intent recognition result.
  • the current call is transferred to the terminal logged in with the manual agent account, so that the manual agent can communicate with the user through the terminal.
  • the voice that matches the user's emotions is selected for human-machine dialogue.
  • the user clearly shows a clear willingness to resist repayment in the dialogue, it can be considered that manual intervention is required to transfer the human-machine dialogue.
  • the manual agent will intervene; or when the human-machine dialogue system cannot effectively answer the user's questions, the human-machine dialogue will be transferred to the terminal logged in with the manual agent account, so as to better provide dialogue services.
  • intention detection is performed in the man-machine dialogue.
  • the intention detection result indicates that the current call requires manual intervention
  • the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-machine dialogue interaction.
  • the step of transferring the current call to the terminal logged in with the manual agent account may include: when the current call requires manual intervention, acquiring the call voice information of the current call and the user of the user in the current call. label; call voice information is converted into call text; the current call is transferred to the terminal logged in with the artificial agent account, and the call text and user label are sent to the terminal for display.
  • the server determines that the current call needs manual intervention, it converts the call voice information of the current call into call text, and obtains the user's user tag; when transferring the current call to the terminal logged in with the manual agent account, the call text
  • the user tag is sent to the terminal, so that the artificial agent can instantly understand the context information of the dialogue and the basic information of the user without having to re-communicate, improving the efficiency and intelligence of dialogue interaction.
  • the call when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • Efficiency and intelligence when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • Efficiency and intelligence when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • the voice dialogue data processing method in this application relates to neural networks, machine learning and voice processing in the field of artificial intelligence; in addition, it can also relate to smart life in the field of smart cities.
  • the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instructions when executed, may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of an apparatus for processing voice dialogue data.
  • the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 .
  • the apparatus Specifically, it can be applied to various electronic devices.
  • the voice dialogue data processing device 300 in this embodiment includes: an acquisition module 301, a conversion module 302, a matrix input module 303, a voice adjustment module 304, and a human-machine dialogue module 305, wherein:
  • the obtaining module 301 is configured to obtain the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction.
  • the conversion module 302 is configured to convert the call voice information and the user label into a vector matrix with weights.
  • the matrix input module 303 is used for inputting the vector matrix with weights into the emotion judgment model to obtain machine dialogue emotion parameters.
  • the speech adjustment module 304 is configured to perform speech adjustment on the pre-recorded standard dialogue speech according to the machine dialogue emotion parameter to obtain the adapted dialogue speech, wherein the speech adjustment includes acoustic adjustment and modal particle adjustment.
  • the man-machine dialogue module 305 is used to conduct man-machine dialogue based on the adapted dialogue voice.
  • the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information;
  • the vector matrix of weights The vector matrix combines the voice characteristics of the user and the personal information of the user.
  • the emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters.
  • the machine dialogue emotion parameters represent the emotion category that the machine should adopt.
  • the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue.
  • the voice dialogue data processing apparatus 300 may further include: an identification acquisition module, a label acquisition module, an initial input module, an initial adjustment module, and an initial dialogue module, wherein:
  • the identification obtaining module is used for obtaining the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction.
  • the label obtaining module is used to obtain the user label corresponding to the user ID, and convert the user label into an initial vector matrix.
  • the initial input module is used to input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotion parameters.
  • the initial adjustment module is used to adjust the voice of the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.
  • the initial dialogue module is used to conduct a man-machine dialogue based on the initial adaptation dialogue voice, and monitor the voice of the man-machine dialogue to obtain the call voice information of the current call.
  • the initial dialogue emotion parameter can be obtained only according to the user label, and the initial standard dialogue voice can be adjusted according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice for the man-machine dialogue. , so that emotional tendencies can be added to the man-machine dialogue even when there is no voice information on the call.
  • the apparatus 300 for processing voice dialogue data may further include: a training acquisition module, a parameter extraction module, a weight allocation module, and an initial training module, wherein:
  • the training acquisition module is used to acquire training corpus, and the training corpus includes user labels, historical dialogue materials and dialogue emotion parameters.
  • the parameter extraction module is used to extract the speech feature parameters of the historical dialogue material.
  • the weight assignment module is used to assign weights to speech feature parameters and user labels to generate a vector matrix with weights.
  • the initial training module is used to use the vector matrix with weights as the model input, and the dialogue emotion parameters as the model output to train the initial emotion judgment model to obtain the emotion judgment model.
  • the speech feature parameters are extracted from the historical dialogue data of the training corpus, and weights are assigned to the speech feature parameters and user labels, so as to differentiate the contributions of the speech feature parameters and user labels to the dialog emotion parameters;
  • the vector matrix with weights is used as the model input, and the dialogue emotion parameters are used as the model output to train the initial emotion judgment model, and an emotion judgment model that can accurately select emotions can be obtained.
  • the apparatus 300 for processing speech dialogue data may further include: a model training module, configured to: in the Gpipe library, based on a genetic algorithm, train the initial emotion judgment model through training corpora, Get the emotion judgment model.
  • a model training module configured to: in the Gpipe library, based on a genetic algorithm, train the initial emotion judgment model through training corpora, Get the emotion judgment model.
  • the initial emotion determination model is trained based on the genetic algorithm, which ensures the accuracy of the emotion determination model obtained by training.
  • the speech adjustment module 304 may include: a semantic parsing submodule, a standard selection submodule, a mode query submodule, and a speech adjustment submodule, wherein:
  • the semantic parsing sub-module is used to perform semantic parsing on the voice information of the call to obtain the semantic parsing result.
  • the standard selection sub-module is used to select the standard dialogue speech corresponding to the semantic analysis result from the pre-recorded standard dialogue speech.
  • the mode query sub-module is used to query the voice adjustment mode of the standard dialogue voice based on the machine dialogue emotion parameter, and the voice adjustment mode includes the acoustic adjustment mode and the modal particle adjustment mode.
  • the voice adjustment sub-module is used to adjust the standard dialogue voice according to the voice adjustment method to obtain the adapted dialogue voice.
  • semantic analysis is performed on the call voice information to select a standard dialogue voice that matches semantically, so as to ensure the semantic rationality of the human-machine dialogue; the voice adjustment method corresponding to the emotional parameter of the machine dialogue is queried, so as to adjust the standard dialogue according to the voice adjustment method.
  • Acoustic adjustment and modal particle adjustment are performed on the dialogue voice, so as to obtain an adapted dialogue voice with emotion.
  • the apparatus 300 for processing voice dialogue data may further include: an information import module, a call determination module, and a call transfer module, wherein:
  • the information import module is used to import the voice information of the current call into the pre-established intent recognition model to obtain the user intent recognition result.
  • the call determination module is used to determine whether the current call requires manual intervention according to the intention recognition result.
  • the call transfer module is used to transfer the current call to the terminal logged in with the manual agent account when the current call requires manual intervention.
  • intention detection is performed in the man-machine dialogue.
  • the intention detection result indicates that the current call requires manual intervention
  • the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-computer dialogue interaction.
  • the call transfer module may include: an acquisition submodule, an information conversion submodule, and a call transfer submodule, wherein:
  • the acquisition sub-module is used to acquire the call voice information of the current call and the user tag of the user in the current call when the current call requires manual intervention.
  • the information conversion submodule is used to convert the voice information of the call into the text of the call.
  • the call transfer sub-module is used to transfer the current call to the terminal logged in with the manual agent account, and send the call text and user label to the terminal for display.
  • the call when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • Efficiency and intelligence when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • Efficiency and intelligence when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue.
  • FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can interact with the user through a keyboard, mouse, remote control, touch pad, or voice-activated device or the like for human-machine voice dialogue interaction.
  • the memory 41 includes at least one type of computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 .
  • the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store the operating system and various application software installed in the computer device 4 , such as computer-readable instructions of a method for processing voice dialogue data.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions or process data stored in the memory 41, for example, computer-readable instructions for executing the voice dialogue data processing method.
  • CPU Central Processing Unit
  • controller a central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 42 is typically used to control the overall operation of the computer device 4 .
  • the processor 42 is configured to execute computer-readable instructions or process data stored in the memory 41, for example, computer-readable instructions for executing the voice dialogue data processing method.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the computer device provided in this embodiment can execute the above-mentioned voice dialogue data processing method.
  • the voice dialogue data processing method here may be the voice dialogue data processing methods of the above embodiments.
  • the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information;
  • the vector matrix of weights The vector matrix combines the voice characteristics of the user and the personal information of the user.
  • the emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters.
  • the machine dialogue emotion parameters represent the emotion category that the machine should adopt.
  • the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the voice dialogue data processing method as described above.
  • the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information;
  • the vector matrix of weights The vector matrix combines the voice characteristics of the user and the personal information of the user.
  • the emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters.
  • the machine dialogue emotion parameters represent the emotion category that the machine should adopt.
  • the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiments of the present application belong to the field of artificial intelligence, and relate to a voice conversation data processing method and apparatus, and a computer device and a storage medium. The method comprises: according to a triggered voice conversation data processing instruction, acquiring call voice information of the current call and a user label of a user in the current call; converting the call voice information and the user label into vector matrixes having weights; inputting the vector matrixes having weights into an emotion determination model to obtain a machine conversation emotion parameter; according to the machine conversation emotion parameter, carrying out voice adjustment on a pre-recorded standard conversation voice to obtain an adapted conversation voice, wherein voice adjustment comprises acoustic adjustment and modal particle adjustment; and carrying out a man-machine conversation on the basis of the adapted conversation voice. In addition, the present application further relates to blockchain technology, and a standard conversation voice can be stored in a blockchain. By means of the present application, the intelligence of man-machine voice conversation interaction is improved.

Description

语音对话数据处理方法、装置、计算机设备及存储介质Voice dialogue data processing method, device, computer equipment and storage medium
本申请要求于2021年02月26日提交中国专利局、申请号为202110218920.0,发明名称为“语音对话数据处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on February 26, 2021 with the application number 202110218920.0 and the invention titled "Voice dialogue data processing method, device, computer equipment and storage medium", the entire content of which is approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种语音对话数据处理方法、装置、计算机设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for processing voice dialogue data.
背景技术Background technique
随着计算机技术的发展,人工智能(Artificial Intelligence,AI)的应用越来越广泛。人机对话是人工智能领域中的重要一环,具有丰富的应用场景,例如,在催收领域,可以引入人工智能进行AI语音催收,可以减少人力成本。With the development of computer technology, artificial intelligence (AI) has become more and more widely used. Human-machine dialogue is an important part in the field of artificial intelligence and has rich application scenarios. For example, in the field of collection, artificial intelligence can be introduced for AI voice collection, which can reduce labor costs.
然而,当前的人机对话技术缺少对语音数据的处理,机器语音都是使用固定的一套语音库。语音库通常是由专业播音员录制,语音追求的是字正腔圆、大方得体。然而发明人意识到,这种语音库较为刻板,面对不同的用户对象和使用场景,显得千篇一律,使得用户体验较差,人机语音对话交互不够智能。However, the current human-machine dialogue technology lacks the processing of speech data, and machine speech uses a fixed set of speech libraries. The voice library is usually recorded by professional announcers, and the pursuit of voice is to be honest and decent. However, the inventor realized that this kind of voice library is relatively rigid, and it appears to be the same in the face of different user objects and usage scenarios, which makes the user experience poor and the human-machine voice dialogue interaction is not intelligent enough.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的在于提出一种语音对话数据处理方法、装置、计算机设备及存储介质,以解决人机语音对话交互不够智能的问题。The purpose of the embodiments of the present application is to provide a voice dialogue data processing method, device, computer equipment and storage medium, so as to solve the problem that the human-machine voice dialogue interaction is not intelligent enough.
为了解决上述技术问题,本申请实施例提供一种语音对话数据处理方法,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application provide a method for processing voice dialogue data, which adopts the following technical solutions:
根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
为了解决上述技术问题,本申请实施例还提供一种语音对话数据处理装置,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a voice dialogue data processing device, which adopts the following technical solutions:
获取模块,用于根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;an acquisition module, configured to acquire the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction;
转换模块,用于将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;a conversion module for converting the call voice information and the user label into a vector matrix with weights;
矩阵输入模块,用于将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;a matrix input module for inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
语音调整模块,用于根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;A voice adjustment module, configured to perform voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
人机对话模块,用于基于所述适配对话语音进行人机对话。A human-machine dialogue module is used to conduct human-machine dialogue based on the adapted dialogue voice.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In order to solve the above technical problems, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
与现有技术相比,本申请实施例主要有以下有益效果:接收到语音对话数据处理指令后,获取当前通话的通话语音信息以及当前通话中用户的用户标签,用户标签可以表征用户的个人信息;将通话语音信息和用户标签转换为带有权重的向量矩阵,向量矩阵融合了用户通话时的语音特征以及用户的个人信息,情绪判定模型对向量矩阵进行处理并映射得到机器对话情绪参数,机器对话情绪参数表征了机器所应采用的情绪类别以及强烈程度,根据机器对话情绪参数对标准对话语音进行声学调整和语气词调整,得到适配对话语音,实现了人机对话时根据用户的对话情绪和个人信息针对性地选择对话情绪,提高了人机语音对话交互的智能性。Compared with the prior art, the embodiment of the present application mainly has the following beneficial effects: after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information. ;Convert the call voice information and user labels into a vector matrix with weights. The vector matrix integrates the voice characteristics of the user during the call and the user's personal information. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The dialogue emotion parameter represents the emotion category and intensity that the machine should adopt. According to the machine dialogue emotion parameter, the standard dialogue speech is adjusted acoustically and the tone particle is adjusted to obtain the adapted dialogue speech, which realizes the human-machine dialogue according to the user's dialogue emotion. It selects the dialogue emotions and personal information in a targeted manner, which improves the intelligence of human-machine voice dialogue interaction.
附图说明Description of drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是根据本申请的语音对话数据处理方法的一个实施例的流程图;2 is a flowchart of an embodiment of a method for processing voice dialogue data according to the present application;
图3是根据本申请的语音对话数据处理装置的一个实施例的结构示意图;3 is a schematic structural diagram of an embodiment of a voice dialogue data processing apparatus according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
需要说明的是,本申请实施例所提供的语音对话数据处理方法一般由服务器执行,相应地,语音对话数据处理装置一般设置于服务器中。It should be noted that the voice dialogue data processing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the voice dialogue data processing apparatus is generally set in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
继续参考图2,示出了根据本申请的语音对话数据处理方法的一个实施例的流程图。所述的语音对话数据处理方法,包括以下步骤:Continuing to refer to FIG. 2 , a flowchart of an embodiment of a method for processing voice dialogue data according to the present application is shown. The described voice dialogue data processing method comprises the following steps:
步骤S201,根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及当前通话中用户的用户标签。Step S201 , according to the triggered voice dialogue data processing instruction, acquire the call voice information of the current call and the user tag of the user in the current call.
在本实施例中,语音对话数据处理方法运行于其上的电子设备(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式与终端进行通信。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。In this embodiment, the electronic device (for example, the server shown in FIG. 1 ) on which the voice dialogue data processing method runs may communicate with the terminal through a wired connection or a wireless connection. It should be pointed out that the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
其中,语音对话数据处理指令可以是指示服务器对通话语音信息进行数据处理的指令。用户标签可以来源于预先建立的用户画像,用户画像中记录了用户的众多标签,刻画了用户的基本信息。在催收场景中,还可以获取用户的信用评估分值,将信用评估分值也作为一种用户标签。Wherein, the voice dialogue data processing instruction may be an instruction instructing the server to perform data processing on the call voice information. User tags can be derived from pre-established user portraits. The user portraits record many tags of the user and describe the basic information of the user. In the collection scenario, the user's credit evaluation score can also be obtained, and the credit evaluation score can also be used as a user label.
具体地,在进行人机对话时,终端采集到即时的通话语音信息后,生成语音对话数据处理指令并将其发送至服务器,服务器根据语音对话数据处理指令获取当前通话的通话语音信息。终端中设置有人机对话系统,可以在服务器的控制下实现人机对话。Specifically, during the man-machine dialogue, after collecting the instant call voice information, the terminal generates a voice dialogue data processing instruction and sends it to the server, and the server obtains the call voice information of the current call according to the voice dialogue data processing instruction. A man-machine dialogue system is set in the terminal, which can realize man-machine dialogue under the control of the server.
在开始人机对话时,服务器还会获取用户的用户标识,并根据用户标识从数据库中查询用户标签。服务器获取通话语音信息的同时,还可以获取用户标签,根据通话语音信息和用户标签进行语音对话数据处理。When the man-machine dialogue is started, the server also obtains the user ID of the user, and queries the user tag from the database according to the user ID. While acquiring the voice information of the call, the server can also acquire the user tag, and process the voice dialogue data according to the voice information of the call and the user tag.
步骤S202,将通话语音信息和用户标签转换为带有权重的向量矩阵。Step S202, converting the call voice information and the user label into a vector matrix with weights.
具体地,服务器可以从通话语音信息中提取语音特征参数,得到特征参数矩阵。Specifically, the server may extract speech feature parameters from the voice information of the call to obtain a feature parameter matrix.
语音特征参数是一种从语音中提取到的参数,用于分析语音的语气和感情。为了在人机对话时模仿真实的人声,所以需要获取训练语料的语音特征参数,语音特征参数可以反应语音的韵律特征,而韵律特征决定了语音在什么地方需要停顿,停顿多久,哪个字或者词语需要重读,哪个词需要轻读等,实现声音的高低曲折,抑扬顿挫。Speech feature parameter is a parameter extracted from speech, which is used to analyze the tone and emotion of speech. In order to imitate the real human voice during human-computer dialogue, it is necessary to obtain the speech feature parameters of the training corpus. The speech feature parameters can reflect the prosodic features of the speech, and the prosodic features determine where the speech needs to pause, how long to pause, which word or word Words need to be re-read, which words need to be read lightly, etc., to achieve high and low tortuous sounds, cadence.
可以先对通话语音信息进行预处理,首先对通话语音信息进行语音端点检测(Voice Activity Detection,VAD),从声音信号流里识别并消除长时间的静音,然后对静音消除后的通话语音信息进行分帧,把声音切分成一小段一小段,每小段称为一帧,切分可以通过移动窗函数来实现,各帧之间可以有交叠。The voice information of the call can be preprocessed first. First, the voice endpoint detection (Voice Activity Detection, VAD) is performed on the voice information of the call, and the long-term silence is identified and eliminated from the sound signal stream, and then the voice information of the call after the silence is eliminated is processed. Framing, dividing the sound into small segments, each segment is called a frame, the segmentation can be achieved by moving the window function, and there can be overlap between each frame.
然后对预处理后的通话语音信息提取特征参数,特征参数包括线性预测倒谱系数 (Linear Prediction Coefficients,LPCC)和Mel倒谱系数(Mel Frequency Cepstral Coefficents,MFCC),提取特征参数的目的是把每一帧通话语音信息转换成多维向量。服务器提取线性预测倒谱系数和Mel倒谱系数中的任意一种即可,并将线性预测倒谱系数或Mel倒谱系数作为语音特征参数。Then extract feature parameters from the preprocessed voice information of the call. The feature parameters include Linear Prediction Coefficients (LPCC) and Mel Cepstral Coefficents (MFCC). The purpose of extracting feature parameters is to A frame of call speech information is converted into a multi-dimensional vector. The server may extract any one of the linear prediction cepstral coefficient and the Mel cepstral coefficient, and use the linear prediction cepstral coefficient or the Mel cepstral coefficient as the speech feature parameter.
在对用户标签进行处理时,需要按照预先设定的量化规则对用户标签进行量化,得到用户标签矩阵。When processing the user tags, it is necessary to quantify the user tags according to a preset quantification rule to obtain a user tag matrix.
由于是同时根据通话语音信息和用户标签进行语音对话数据处理,因此可以给特征参数矩阵和用户标签矩阵分配权重。其中,权重分配的比例可以预先设定,并且可以根据实际需要灵活调整。带有权重的特征参数矩阵和用户标签矩阵组成向量矩阵。Since the voice dialogue data is processed according to the call voice information and the user label at the same time, weights can be assigned to the feature parameter matrix and the user label matrix. Among them, the proportion of weight distribution can be preset, and can be flexibly adjusted according to actual needs. The feature parameter matrix with weights and the user label matrix form a vector matrix.
步骤S203,将带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数。Step S203, the vector matrix with weights is input into the emotion judgment model to obtain the machine dialogue emotion parameters.
其中,情绪判定模型用于判定人机对话时人机对话系统所应采用的情绪及其强烈程度。机器对话情绪参数是人机对话时人机对话系统应采用的语音情绪的量化评估值。Among them, the emotion determination model is used to determine the emotion and its intensity that should be adopted by the human-machine dialogue system during the human-computer dialogue. The machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt during the human-machine dialogue.
具体地,情绪判定模型需要预先通过模型训练的到,情绪判定模型可以将对向量矩阵进行卷积和池化,并映射为机器对话情绪参数;即,情绪判定模型可以根据通话语音信息中用户语音信息和用户标签,输出机器对话情绪参数。Specifically, the emotion judgment model needs to be trained by the model in advance, and the emotion judgment model can convolve and pool the vector matrix and map it to the machine dialogue emotion parameter; that is, the emotion judgment model can Information and user labels, outputting machine dialogue sentiment parameters.
机器对话情绪参数是人机对话系统应采用的语音情绪的量化评估值,可以是一个数值,将对话情绪参数的全部取值范围进行区间划分,每个区间对应一种对话情绪,例如温和、谨慎、激进等。每一种情绪也可以分为多个区间,每个区间对应于情绪的强烈程度。The machine dialogue emotion parameter is the quantitative evaluation value of the speech emotion that the human-machine dialogue system should adopt. It can be a numerical value. The entire value range of the dialogue emotion parameter is divided into intervals. Each interval corresponds to a dialogue emotion, such as mildness and caution. , radical, etc. Each emotion can also be divided into multiple intervals, each interval corresponding to the intensity of the emotion.
步骤S204,根据机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,语音调整包括声学调整和语气词调整。Step S204, performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment.
其中,标准催收语音可以是不带有情绪的催收语音。The standard collection voice may be a collection voice without emotion.
具体地,服务器中预先录制了标准对话语音,标准对话语音可以是对真人语音录制得到,不带有情绪。服务器根据机器对话情绪参数对标准对话语音进行语音调整,从而更改标准对话语音的情绪倾向,得到适配对话语音。其中,语音调整包括声学调整和语气词调整,声学调整可以改变标准对话语音的声学特征,语气词调整可以是在标准对话语音中拼接包含语气词的语音,语气词也可以在一定程度改变语音的情绪倾向。Specifically, a standard dialogue voice is pre-recorded in the server, and the standard dialogue voice can be recorded from a real voice without emotion. The server performs voice adjustment on the standard dialogue voice according to the machine dialogue emotional parameters, thereby changing the emotional tendency of the standard dialogue voice, and obtaining the adapted dialogue voice. Among them, voice adjustment includes acoustic adjustment and modal particle adjustment. Acoustic adjustment can change the acoustic characteristics of standard dialogue speech, modal particle adjustment can be splicing the voice containing modal particles in the standard dialogue voice, and modal particles can also change the pronunciation to a certain extent. emotional tendencies.
例如,在语音催收场景中,当用户个人信用状况较差,且人机对话时用户态度较差时,会输出具有较强激进情绪的对话情绪参数,语音调整后得到具有激进情绪的适配对话语音,以便对用户进行警告等对话效果。For example, in the voice collection scenario, when the user's personal credit status is poor and the user's attitude is poor during the man-machine dialogue, the dialogue emotion parameters with strong aggressive emotions will be output, and the adaptive dialogue with aggressive emotions will be obtained after voice adjustment. Voice for dialog effects such as warnings to users.
需要强调的是,为进一步保证上述标准对话语音的私密和安全性,上述标准对话语音还可以存储于一区块链的节点中。服务器可以从区块链的节点中获取标准对话语音。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned standard dialogue voice, the above-mentioned standard dialogue voice can also be stored in a node of a blockchain. The server can obtain standard conversational speech from the nodes of the blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
步骤S205,基于适配对话语音进行人机对话。Step S205, a man-machine dialogue is performed based on the adapted dialogue voice.
具体地,服务器将适配对话语音发送至终端,由终端播放适配对话语音以实现人机对话。适配对话语音是根据人机对话时用户的对话情绪和个人信息生成的,语音情绪上具有较强的针对性,提高了人机语音对话交互的智能性。Specifically, the server sends the adapted dialogue voice to the terminal, and the terminal plays the adapted dialogue voice to realize the man-machine dialogue. The adaptive dialogue voice is generated according to the user's dialogue emotions and personal information during the man-machine dialogue, and the voice emotion has a strong pertinence, which improves the intelligence of the man-machine voice dialogue interaction.
本实施例中,接收到语音对话数据处理指令后,获取当前通话的通话语音信息以及当前通话中用户的用户标签,用户标签可以表征用户的个人信息;将通话语音信息和用户标签转换为带有权重的向量矩阵,向量矩阵融合了用户通话时的语音特征以及用户的个人信息,情绪判定模型对向量矩阵进行处理并映射得到机器对话情绪参数,机器对话情绪参数表征了机器所应采用的情绪类别以及强烈程度,根据机器对话情绪参数对标准对话语音进行声学调整和语气词调整,得到适配对话语音,实现了人机对话时根据用户的对话情绪和 个人信息针对性地选择对话情绪,提高了人机语音对话交互的智能性。In this embodiment, after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information; The vector matrix of weights. The vector matrix combines the voice characteristics of the user and the personal information of the user. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The machine dialogue emotion parameters represent the emotion category that the machine should adopt. As well as the intensity, the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue. The intelligence of human-machine voice dialogue interaction.
进一步的,上述步骤S201之前,还可以包括:根据接收到的人机对话启动指令,获取人机对话启动指令中的用户标识;获取用户标识所对应用户标签,并将用户标签转换为初始向量矩阵;将初始向量矩阵输入情绪判定模型,得到初始对话情绪参数;根据初始对话情绪参数对预先录制好的初始标准对话语音进行语音调整,得到初始适配对话语音;基于初始适配对话语音进行人机对话,并对人机对话进行语音监听,得到当前通话的通话语音信息。Further, before the above step S201, it may also include: obtaining the user identifier in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction; obtaining the user label corresponding to the user identifier, and converting the user label into an initial vector matrix ; Input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotional parameters; perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotional parameters to obtain the initial adaptation dialogue voice; Dialogue, and voice monitoring of the man-machine dialogue to obtain the voice information of the current call.
其中,人机对话启动指令可以是指示服务器开始人机对话的指令。在人机对话刚开始时,用户尚未开始对话,不存在包括用户语音的通话语音信息,可以由服务器率先开始人机对话。The man-machine dialogue initiation instruction may be an instruction instructing the server to start the man-machine dialogue. At the beginning of the man-machine dialogue, the user has not started the dialogue, and there is no call voice information including the user's voice, and the server can take the lead in starting the man-machine dialogue.
具体地,服务器根据接收到人机对话启动指令开始人机对话。人机对话启动指令中可以包括用户标识。服务器提取用户标识,在数据库中根据用户标识获取用户的用户标签。Specifically, the server starts the man-machine dialogue according to the received man-machine dialogue start instruction. The user ID may be included in the man-machine dialogue initiation instruction. The server extracts the user ID, and obtains the user label of the user according to the user ID in the database.
服务器将获取到的用户标签转换为用户标签矩阵,由于没有通话语音信息可以将特征参数矩阵设置为零,从而得到初始向量矩阵。服务器将初始向量矩阵输入情绪判定模型,情绪判定模型根据初始向量矩阵生成初始对话情绪参数。The server converts the obtained user label into a user label matrix. Since there is no voice information of the call, the characteristic parameter matrix can be set to zero, thereby obtaining an initial vector matrix. The server inputs the initial vector matrix into the emotion judgment model, and the emotion judgment model generates initial dialogue emotion parameters according to the initial vector matrix.
服务器获取初始标准对话语音,初始标准对话语音可以是人机对话启动时机器可以播放的语音,不带有情绪。服务器根据初始对话情绪参数对初始标准对话语音进行语音调整,得到初始适配对话语音。The server obtains the initial standard dialogue voice, and the initial standard dialogue voice may be the voice that can be played by the machine when the man-machine dialogue is started, without emotion. The server performs voice adjustment on the initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.
服务器将初始适配对话语音发送至终端,终端播放初始适配对话语音从而开始人机对话,并在人机对话开始后进行语音监听,得到当前通话的通话语音信息。可以理解,初始适配对话语音是在没有通话语音信息的情况下,根据用户的个人信息得到的情绪适配语音。The server sends the initial adaptation dialogue voice to the terminal, the terminal plays the initial adaptation dialogue voice to start the man-machine dialogue, and performs voice monitoring after the man-machine dialogue starts to obtain the call voice information of the current call. It can be understood that the initially adapted dialogue voice is an emotionally adapted voice obtained according to the user's personal information in the absence of call voice information.
在一个实施例中,服务器在接收到人机对话启动指令后,还可以获取初始标准对话语音,直接根据初始标准对话语音进行人机对话。在得到通话语音信息后,再根据通话语音信息和用户标签实时计算机器对话情绪参数。In one embodiment, after receiving the man-machine dialogue start instruction, the server may also obtain the initial standard dialogue voice, and conduct the man-machine dialogue directly according to the initial standard dialogue voice. After the call voice information is obtained, the machine dialogue emotional parameters are calculated in real time according to the call voice information and the user tag.
本实施例中,在人机对话刚开始时,可以仅根据用户标签得到初始对话情绪参数,根据初始对话情绪参数对初始标准对话语音进行语音调整,得到用于人机对话的初始适配对话语音,使得在没有通话语音信息时也可以在人机对话中加入情绪倾向。In this embodiment, at the beginning of the man-machine dialogue, the initial dialogue emotion parameter can be obtained only according to the user label, and the initial standard dialogue voice can be adjusted according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice for the man-machine dialogue. , so that emotional tendencies can be added to the man-machine dialogue even when there is no voice information on the call.
进一步的,上述根据接收到的人机对话启动指令,获取人机对话启动指令中的用户标识的步骤之前,还可以包括:获取训练语料,训练语料包括用户标签、历史对话语料和对话情绪参数;提取历史对话语料的语音特征参数;给语音特征参数和用户标签进行权重分配,以生成带有权重的向量矩阵;将带有权重的向量矩阵作为模型输入,将对话情绪参数作为模型输出,对初始情绪判定模型进行训练,得到情绪判定模型。Further, before the step of acquiring the user identifier in the man-machine dialogue initiation instruction according to the received man-machine dialogue initiation instruction, the method may further include: acquiring training corpus, the training corpus including user labels, historical dialogue materials and dialogue emotion parameters; Extract the speech feature parameters of the historical dialogue material; assign weights to the speech feature parameters and user labels to generate a vector matrix with weights; take the vector matrix with weights as model input, and use dialogue emotion parameters as model output. The emotion judgment model is trained to obtain an emotion judgment model.
其中,历史对话语料可以由人工对存储的对话语料进行筛选得到,历史对话语料包括第一历史语音和第二历史语音,其中,第一历史语音可以是第一用户或者人机对话系统的语音,第二历史语音可以是第二用户在对话中的语音。筛选到的历史对话语料中,第一历史语音在情绪上与第二用户的用户信息和第二历史语音有较好的匹配性。对话情绪参数衡量了第一历史语音的情绪类别以及情绪的强烈程度。The historical dialogue data can be obtained by manually filtering the stored dialogue data, and the historical dialogue data includes a first historical voice and a second historical voice, wherein the first historical voice can be the voice of the first user or the man-machine dialogue system, The second historical voice may be the voice of the second user in the conversation. In the filtered historical dialogue data, the first historical voice has a good match emotionally with the user information of the second user and the second historical voice. The dialog emotion parameter measures the emotion category of the first historical speech and the intensity of emotion.
具体地,可以从训练语料库获取训练语料,训练语料包括用户标签、历史对话语料和对话情绪参数。每份训练语料中的用户标签、历史对话语料和对话情绪参数都是匹配的。Specifically, the training corpus can be obtained from the training corpus, and the training corpus includes user tags, historical dialogue materials, and dialogue emotion parameters. User labels, historical dialogue data and dialogue sentiment parameters in each training corpus are matched.
可以先对历史对话语料进行语音端点检测,然后进行分帧处理。接着对分帧处理后的语音数据提取语音特征参数,语音特征参数包括线性预测倒谱系数LPCC和Mel倒谱系数MFCC。服务器提取线性预测倒谱系数和Mel倒谱系数中的任意一种即可。The speech endpoint detection can be performed on the historical dialogue data first, and then framed. Then, the speech characteristic parameters are extracted from the framed speech data, and the speech characteristic parameters include linear prediction cepstral coefficients LPCC and Mel cepstral coefficients MFCC. The server may extract any one of linear prediction cepstral coefficients and Mel cepstral coefficients.
服务器提取到的语音特征参数包括第一历史语音的语音特征参数和第二历史语音的语音特征参数。由于本申请是确定与用户对话时所需的语音情绪及其强烈程度,来自第二历史语音的语音特征参数可以着重考量,因此第二历史语音的语音特征参数可以具有较大 的权重。同时,用户标签也需要分配权重,即,权重可以由第一历史语音的语音特征参数、第二历史语音的语音特征参数和用户标签共享。分配的权重可以根据实际需要灵活调整。The voice feature parameters extracted by the server include voice feature parameters of the first historical voice and voice feature parameters of the second historical voice. Since the present application is to determine the voice emotion and its intensity required for the dialogue with the user, the voice feature parameters from the second historical voice can be heavily considered, so the voice feature parameters of the second historical voice can have a larger weight. At the same time, the user tag also needs to be assigned a weight, that is, the weight can be shared by the voice feature parameter of the first historical voice, the voice feature parameter of the second historical voice, and the user tag. The assigned weights can be flexibly adjusted according to actual needs.
带有权重的语音特征参数和用户标签可以组成带有权重的向量矩阵,将带有权重的向量矩阵输入初始情绪判定模型,将对话情绪参数作为初始情绪判定模型的期望输出。由初始情绪判定模型对带有权重的向量矩阵进行处理,输出预测标签。预测标签是一种训练阶段所采用的量化评估值,用于量化评估人或者机器在与用户对话时,应采取的情绪及其强烈程度。The weighted speech feature parameters and user labels can form a weighted vector matrix, and the weighted vector matrix is input into the initial emotion judgment model, and the dialogue emotion parameters are used as the expected output of the initial emotion judgment model. The vector matrix with weights is processed by the initial emotion judgment model, and the predicted label is output. The prediction label is a quantitative evaluation value used in the training phase, which is used to quantitatively evaluate the emotion and intensity that a human or machine should take when talking to a user.
服务器根据预测标签和对话情绪参数计算模型损失,以减小模型损失为目标,调整初始情绪判定模型的模型参数,并在参数调整后将向量矩阵重新输入初始情绪判定模型进行迭代,直至得到的模型损失小于预设的损失阈值,服务器停止迭代,得到情绪判定模型。The server calculates the model loss according to the predicted labels and the dialogue emotion parameters, adjusts the model parameters of the initial emotion judgment model with the goal of reducing the model loss, and re-inputs the vector matrix into the initial emotion judgment model after the parameter adjustment for iteration until the obtained model is obtained. When the loss is less than the preset loss threshold, the server stops iterating and obtains the emotion judgment model.
本实施例中,获取到训练语料后,从训练语料的历史对话语料中提取语音特征参数,给语音特征参数和用户标签分配权重,以差异化语音特征参数和用户标签对对话情绪参数的贡献;将带有权重的向量矩阵作为模型输入,将对话情绪参数作为模型输出训练初始情绪判定模型,可以得到能准确进行情绪选取的情绪判定模型。In this embodiment, after the training corpus is obtained, the speech feature parameters are extracted from the historical dialogue data of the training corpus, and weights are assigned to the speech feature parameters and user labels, so as to differentiate the contributions of the speech feature parameters and user labels to the dialog emotion parameters; The vector matrix with weights is used as the model input, and the dialogue emotion parameters are used as the model output to train the initial emotion judgment model, and an emotion judgment model that can accurately select emotions can be obtained.
进一步的,在一个实施例中,上述根据接收到的人机对话启动指令,获取人机对话启动指令中的用户标识的步骤之前,还包括:可以包括:在Gpipe库中,基于遗传算法,通过训练语料对初始情绪判定模型进行训练,得到情绪判定模型。Further, in one embodiment, before the step of obtaining the user identifier in the man-machine dialogue start instruction according to the received man-machine dialogue start instruction, the method further includes: in the Gpipe library, based on a genetic algorithm, by The training corpus trains the initial emotion judgment model to obtain the emotion judgment model.
具体地,初始情绪判定模型可以为深度神经网络(Deep Neural Networks,DNN)。DNN内部的神经网络层可以分为三类,输入层,隐藏层和输出层,一般来说,第一层是输入层,最后一层是输出层,而中间的层数都是隐藏层,层与层之间都是全连接的。Specifically, the initial emotion determination model may be a deep neural network (Deep Neural Networks, DNN). The neural network layers inside DNN can be divided into three categories, input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. All layers are fully connected.
为了保证对初始情绪判定模型进行准确地训练,可以在Gpipe库中,基于进化算法,通过训练语料对初始情绪判定模型进行训练。其中,Gpipe是一个分布式机器学习、可扩展的管道并行库,可以学习巨型深度神经网络。Gpipe使用同步随机梯度下降和管道并行性进行训练,适用于由多个连续层组成的任何DNN。Gpipe通过部署更多加速器来训练更大的模型,允许对跨加速器的模型进行分区,具体是将模型分割并划分给不同的加速器,自动将小batch拆分为更小的微batch,从而实现跨多个加速器的高效训练,而梯度一致在微批次中积累,因此分区数量不会影响模型质量。Gpipe支持部署更多加速器来训练更大的模型,并在不调整超参数的情况下,使得模型输出结果更准确,达到提升性能的效果。In order to ensure the accurate training of the initial emotion judgment model, the initial emotion judgment model can be trained through the training corpus based on the evolutionary algorithm in the Gpipe library. Among them, Gpipe is a distributed machine learning, scalable pipeline parallel library that can learn giant deep neural networks. Gpipe is trained using synchronous stochastic gradient descent and pipeline parallelism, suitable for any DNN consisting of multiple consecutive layers. Gpipe trains larger models by deploying more accelerators, allowing the model to be partitioned across accelerators. Specifically, the model is divided and divided into different accelerators, and small batches are automatically split into smaller micro-batches. Efficient training of multiple accelerators, while gradients are consistently accumulated in micro-batches, so the number of partitions does not affect model quality. Gpipe supports deploying more accelerators to train larger models, and without adjusting hyperparameters, the model output results are more accurate and performance is improved.
进化算法是一类算法的统称,是模拟自然选择和遗传等生物进化机制的一种搜索算法,其中一类就是遗传算法。各类进化算法本质上都是迭代算法。具有种群、个体和编码的概念。其中:(1)种群,可以理解为若干个模型;(2)个体,可以理解为某一个模型;(3)编码,就是将对象用计算机语言描述,比如将网络结构用固定长度的二进制字符串表示。Evolutionary algorithm is a general term for a class of algorithms. It is a search algorithm that simulates biological evolution mechanisms such as natural selection and genetics. One of them is genetic algorithm. All kinds of evolutionary algorithms are iterative algorithms in nature. Has the concepts of populations, individuals and codes. Among them: (1) population, which can be understood as several models; (2) individual, which can be understood as a certain model; (3) coding, which is to describe the object in computer language, such as the network structure with a fixed-length binary string express.
进化算法中,每一次产生下一代需要3个步骤,即选择,交叉,变异:In the evolutionary algorithm, each generation of the next generation requires 3 steps, namely selection, crossover, and mutation:
(1)选择过程,要实现的就是从群体中选择更优的对象,比如精度更高的模型。(1) The selection process, to achieve is to select a better object from the group, such as a model with higher accuracy.
(2)交叉过程,它就是要实现不同优秀对象的信息交换,比如两个好模型的模块交换。(2) Crossover process, which is to realize the information exchange of different excellent objects, such as the module exchange of two good models.
(3)变异过程,它是对个体的微小改变,相对于交叉过程,能引入更多的随机性,有助于跳出局部最优解。(3) Mutation process, which is a small change to the individual. Compared with the crossover process, it can introduce more randomness and help to jump out of the local optimal solution.
在模型变异之后,通过适应函数对模型进行评估,选取出更优的模型留下,直至得到最后最优的模型。适应函数可以为损失函数,用于衡量模型计算结果的准确性。After the model is mutated, the model is evaluated through the adaptation function, and a better model is selected to remain until the final optimal model is obtained. The fitness function can be a loss function, which is used to measure the accuracy of the model calculation result.
本实施例中,在Gpipe库中,基于遗传算法对初始情绪判定模型进行训练,保证了训练得到的情绪判定模型的准确性。In this embodiment, in the Gpipe library, the initial emotion determination model is trained based on the genetic algorithm, which ensures the accuracy of the emotion determination model obtained by training.
进一步的,上述步骤S205可以包括:对通话语音信息进行语义解析,得到语义解析结果;从预先录制好的标准对话语音中选取与语义解析结果对应的标准对话语音;基于机器对话情绪参数,查询标准对话语音的语音调整方式,语音调整方式包括声学调整方式和语气词调整方式;根据语音调整方式,对标准对话语音进行语音调整,得到适配对话语音。Further, the above step S205 may include: performing semantic analysis on the call voice information to obtain a semantic analysis result; selecting a standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice; The voice adjustment method of the dialogue voice, the voice adjustment method includes an acoustic adjustment method and a modal particle adjustment method; according to the voice adjustment method, the standard dialogue voice is adjusted to obtain an adapted dialogue voice.
具体地,服务器对通话语音信息进行语义解析,得到语义解析结果。可以先将通话语 音信息换转为通话文本,通过预先训练好的意图识别模型对通话文本进行意图识别,得到用户意图,将用户意图作为语义解析结果;也可以计算通话文本与预先存储的各模板文本计算相似度,将相似度最高且相似度大于预设的相似度阈值的模板文本作为语义解析结果。Specifically, the server performs semantic analysis on the call voice information to obtain a semantic analysis result. You can first convert the call voice information into call text, and use the pre-trained intent recognition model to identify the call text to obtain the user's intent, and use the user's intent as the result of semantic analysis; you can also calculate the call text and the pre-stored templates. The text calculates the similarity, and takes the template text with the highest similarity and the similarity greater than the preset similarity threshold as the semantic analysis result.
预先录制好的标准对话语音可以有多个,不同的标准对话语音具有不同的语义含义。可以从预先录制好的多个标准对话语音中,选取与语义解析结果相匹配的标准对话语音。There may be multiple pre-recorded standard dialogue voices, and different standard dialogue voices have different semantic meanings. The standard dialogue voice that matches the semantic analysis result can be selected from multiple pre-recorded standard dialogue voices.
每个机器情绪对话参数都预设了语音调整方式。语音调整方式是指对标准对话语音进行调整的方式,包括声学调整方式和语气词调整方式。其中,声学调整方式规定了声学特征信息的调整方式,包括对表示音色的能量集中区、共振峰频率、共振峰强度和带宽,以及表示语音韵律特性的时长、基频、平均语声功率等进行调整。语气词调整方式规定了在标准对话语音中加入语气词的方式。Each machine emotional dialogue parameter has preset voice adjustment methods. The voice adjustment method refers to the method of adjusting the standard dialogue voice, including the acoustic adjustment method and the modal particle adjustment method. Among them, the acoustic adjustment method specifies the adjustment method of the acoustic feature information, including the energy concentration area, formant frequency, formant intensity and bandwidth representing the timbre, as well as the duration, fundamental frequency, and average voice power representing the prosody characteristics of speech. Adjustment. Modal particle adjustment mode specifies the way in which modal particles are added to standard dialogue speech.
服务器根据语音调整方式对预先录制好的标准对话语音进行语音调整,从而更改标准对话语音的情绪倾向,得到适配对话语音。例如,可以通过语音调整将标准对话语音的情绪倾向调整为愉快,在声学调整方式中,可以提高音调、提高平均语声功率等;在语气词调整方式中,可以在标准对话语音末尾添加“哈哈”等语气词。The server performs voice adjustment on the pre-recorded standard dialogue voice according to the voice adjustment method, so as to change the emotional tendency of the standard dialogue voice, and obtain the adapted dialogue voice. For example, the emotional tendency of standard dialogue speech can be adjusted to be pleasant through voice adjustment. In the acoustic adjustment method, the pitch can be increased, the average voice power can be increased, etc.; " and other words.
在应用时,由于需要满足时效性,在进行情绪判定时可以不再使用进化算法,从而可以即时调整机器对话情绪。In the application, due to the need to meet the timeliness, the evolutionary algorithm can no longer be used in the emotion judgment, so that the emotion of the machine dialogue can be adjusted immediately.
本实施例中,对通话语音信息进行语义解析从而选取语义匹配的标准对话语音,保证人机对话在语义上的合理性;查询与机器对话情绪参数对应的语音调整方式,从而根据语音调整方式对标准对话语音进行声学调整和语气词调整,从而得到带有情绪的适配对话语音。In this embodiment, semantic analysis is performed on the call voice information to select a standard dialogue voice that matches semantically, so as to ensure the semantic rationality of the human-machine dialogue; the voice adjustment method corresponding to the emotional parameter of the machine dialogue is queried, so as to adjust the standard dialogue according to the voice adjustment method. Acoustic adjustment and modal particle adjustment are performed on the dialogue voice, so as to obtain an adapted dialogue voice with emotion.
进一步的,上述步骤S205之后,还可以包括:当前通话的通话语音信息导入预先建立的意图识别模型,得到用户意图识别结果;据意图识别结果确定当前通话是否需要人工介入;当前通话需要人工介入时,将当前通话转接给人工坐席账号登录的终端。Further, after the above step S205, it may also include: importing the voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result; determining whether the current call requires manual intervention according to the intent recognition result; when the current call requires manual intervention to transfer the current call to the terminal logged in with the manual agent account.
其中,意图识别模型可以是识别用户意图的模型。Wherein, the intent recognition model may be a model for identifying user intent.
具体地,服务器还可以在通话中对用户意图进行检测与监控,通过预先训练好的意图识别模型识别用户意图。服务器将当前通话的通话语音信息导入预先建立的意图识别模型,意图识别模型可以将通话语音信息转换为通话文本,对通话文本进行语义分析,输出意图识别结果。Specifically, the server can also detect and monitor the user's intent during the call, and identify the user's intent through a pre-trained intent recognition model. The server imports the call voice information of the current call into the pre-established intent recognition model. The intent recognition model can convert the call voice information into call text, perform semantic analysis on the call text, and output the intent recognition result.
当意图识别结果表明当前通话需要人工介入时,将当前通话转接至人工坐席账号所登录的终端,以便人工坐席通过终端与用户进行对话。When the intent recognition result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account, so that the manual agent can communicate with the user through the terminal.
举例说明,在AI催收场景中,选取与用户情绪匹配的语音进行人机对话,当用户在对话中明显表现出对还款明显的抗拒意愿时,可以认为需要人工介入,将人机对话转接至人工坐席账号所登录的终端,由人工坐席介入;或者当人机对话系统无法有效解答用户疑问时,将人机对话转接至人工坐席账号所登录的终端,以便更好地提供对话服务。For example, in the AI collection scenario, the voice that matches the user's emotions is selected for human-machine dialogue. When the user clearly shows a clear willingness to resist repayment in the dialogue, it can be considered that manual intervention is required to transfer the human-machine dialogue. To the terminal logged in by the manual agent account, the manual agent will intervene; or when the human-machine dialogue system cannot effectively answer the user's questions, the human-machine dialogue will be transferred to the terminal logged in with the manual agent account, so as to better provide dialogue services.
本实施例中,在人机对话中进行意图检测,当意图检测结果表明当前通话需要人工介入时,将当前通话转接至人工坐席账号登录的终端,将人工坐席及时引入人机对话,以提升人机对话交互的智能性。In this embodiment, intention detection is performed in the man-machine dialogue. When the intention detection result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-machine dialogue interaction.
进一步的,上述当前通话需要人工介入时,将当前通话转接给人工坐席账号登录的终端的步骤可以包括:当当前通话需要人工介入时,获取当前通话的通话语音信息以及当前通话中用户的用户标签;通话语音信息转换为通话文本;当前通话转接给人工坐席账号登录的终端,并将通话文本和用户标签发送至终端进行展示。Further, when the above-mentioned current call requires manual intervention, the step of transferring the current call to the terminal logged in with the manual agent account may include: when the current call requires manual intervention, acquiring the call voice information of the current call and the user of the user in the current call. label; call voice information is converted into call text; the current call is transferred to the terminal logged in with the artificial agent account, and the call text and user label are sent to the terminal for display.
具体地,当服务器确定当前通话需要人工介入时,将当前通话的通话语音信息转换为通话文本,并获取用户的用户标签;在将当前通话转接给人工坐席账号登录的终端时,将通话文本和用户标签发送至终端,以便人工坐席即时了解对话的上下文信息和用户的基本信息,而不必重新进行沟通,提高对话交互的效率和智能性。Specifically, when the server determines that the current call needs manual intervention, it converts the call voice information of the current call into call text, and obtains the user's user tag; when transferring the current call to the terminal logged in with the manual agent account, the call text The user tag is sent to the terminal, so that the artificial agent can instantly understand the context information of the dialogue and the basic information of the user without having to re-communicate, improving the efficiency and intelligence of dialogue interaction.
本实施例中,将通话转接给人工坐席账号登录的终端时,将对话文本和用户标签一并发送至终端,使得对话可以在之前的基础上进行,而不必重新沟通,提高了对话交互的效率和智能性。In this embodiment, when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue. Efficiency and intelligence.
本申请中的语音对话数据处理方法涉及人工智能领域中的神经网络、机器学习和语音处理;此外,还可以涉及智慧城市领域中的智慧生活。The voice dialogue data processing method in this application relates to neural networks, machine learning and voice processing in the field of artificial intelligence; in addition, it can also relate to smart life in the field of smart cities.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , the computer-readable instructions, when executed, may include the processes of the above-mentioned method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种语音对话数据处理装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of an apparatus for processing voice dialogue data. The apparatus embodiment corresponds to the method embodiment shown in FIG. 2 . The apparatus Specifically, it can be applied to various electronic devices.
如图3所示,本实施例所述的语音对话数据处理装置300包括:获取模块301、转换模块302、矩阵输入模块303、语音调整模块304以及人机对话模块305,其中:As shown in FIG. 3 , the voice dialogue data processing device 300 in this embodiment includes: an acquisition module 301, a conversion module 302, a matrix input module 303, a voice adjustment module 304, and a human-machine dialogue module 305, wherein:
获取模块301,用于根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及当前通话中用户的用户标签。The obtaining module 301 is configured to obtain the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction.
转换模块302,用于将通话语音信息和用户标签转换为带有权重的向量矩阵。The conversion module 302 is configured to convert the call voice information and the user label into a vector matrix with weights.
矩阵输入模块303,用于将带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数。The matrix input module 303 is used for inputting the vector matrix with weights into the emotion judgment model to obtain machine dialogue emotion parameters.
语音调整模块304,用于根据机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,语音调整包括声学调整和语气词调整。The speech adjustment module 304 is configured to perform speech adjustment on the pre-recorded standard dialogue speech according to the machine dialogue emotion parameter to obtain the adapted dialogue speech, wherein the speech adjustment includes acoustic adjustment and modal particle adjustment.
人机对话模块305,用于基于适配对话语音进行人机对话。The man-machine dialogue module 305 is used to conduct man-machine dialogue based on the adapted dialogue voice.
本实施例中,接收到语音对话数据处理指令后,获取当前通话的通话语音信息以及当前通话中用户的用户标签,用户标签可以表征用户的个人信息;将通话语音信息和用户标签转换为带有权重的向量矩阵,向量矩阵融合了用户通话时的语音特征以及用户的个人信息,情绪判定模型对向量矩阵进行处理并映射得到机器对话情绪参数,机器对话情绪参数表征了机器所应采用的情绪类别以及强烈程度,根据机器对话情绪参数对标准对话语音进行声学调整和语气词调整,得到适配对话语音,实现了人机对话时根据用户的对话情绪和个人信息针对性地选择对话情绪,提高了人机语音对话交互的智能性。In this embodiment, after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information; The vector matrix of weights. The vector matrix combines the voice characteristics of the user and the personal information of the user. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The machine dialogue emotion parameters represent the emotion category that the machine should adopt. As well as the intensity, the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue. The intelligence of human-machine voice dialogue interaction.
在本实施例的一些可选的实现方式中,语音对话数据处理装置300还可以包括:标识获取模块、标签获取模块、初始输入模块、初始调整模块以及初始对话模块,其中:In some optional implementations of this embodiment, the voice dialogue data processing apparatus 300 may further include: an identification acquisition module, a label acquisition module, an initial input module, an initial adjustment module, and an initial dialogue module, wherein:
标识获取模块,用于根据接收到的人机对话启动指令,获取人机对话启动指令中的用户标识。The identification obtaining module is used for obtaining the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction.
标签获取模块,用于获取用户标识所对应用户标签,并将用户标签转换为初始向量矩阵。The label obtaining module is used to obtain the user label corresponding to the user ID, and convert the user label into an initial vector matrix.
初始输入模块,用于将初始向量矩阵输入情绪判定模型,得到初始对话情绪参数。The initial input module is used to input the initial vector matrix into the emotion judgment model to obtain the initial dialogue emotion parameters.
初始调整模块,用于根据初始对话情绪参数对预先录制好的初始标准对话语音进行语音调整,得到初始适配对话语音。The initial adjustment module is used to adjust the voice of the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice.
初始对话模块,用于基于初始适配对话语音进行人机对话,并对人机对话进行语音监 听,得到当前通话的通话语音信息。The initial dialogue module is used to conduct a man-machine dialogue based on the initial adaptation dialogue voice, and monitor the voice of the man-machine dialogue to obtain the call voice information of the current call.
本实施例中,在人机对话刚开始时,可以仅根据用户标签得到初始对话情绪参数,根据初始对话情绪参数对初始标准对话语音进行语音调整,得到用于人机对话的初始适配对话语音,使得在没有通话语音信息时也可以在人机对话中加入情绪倾向。In this embodiment, at the beginning of the man-machine dialogue, the initial dialogue emotion parameter can be obtained only according to the user label, and the initial standard dialogue voice can be adjusted according to the initial dialogue emotion parameter to obtain the initial adapted dialogue voice for the man-machine dialogue. , so that emotional tendencies can be added to the man-machine dialogue even when there is no voice information on the call.
在本实施例的一些可选的实现方式中,语音对话数据处理装置300还可以包括:训练获取模块、参数提取模块、权重分配模块以及初始训练模块,其中:In some optional implementations of this embodiment, the apparatus 300 for processing voice dialogue data may further include: a training acquisition module, a parameter extraction module, a weight allocation module, and an initial training module, wherein:
训练获取模块,用于获取训练语料,训练语料包括用户标签、历史对话语料和对话情绪参数。The training acquisition module is used to acquire training corpus, and the training corpus includes user labels, historical dialogue materials and dialogue emotion parameters.
参数提取模块,用于提取历史对话语料的语音特征参数。The parameter extraction module is used to extract the speech feature parameters of the historical dialogue material.
权重分配模块,用于给语音特征参数和用户标签进行权重分配,以生成带有权重的向量矩阵。The weight assignment module is used to assign weights to speech feature parameters and user labels to generate a vector matrix with weights.
初始训练模块,用于将带有权重的向量矩阵作为模型输入,将对话情绪参数作为模型输出,对初始情绪判定模型进行训练,得到情绪判定模型。The initial training module is used to use the vector matrix with weights as the model input, and the dialogue emotion parameters as the model output to train the initial emotion judgment model to obtain the emotion judgment model.
本实施例中,获取到训练语料后,从训练语料的历史对话语料中提取语音特征参数,给语音特征参数和用户标签分配权重,以差异化语音特征参数和用户标签对对话情绪参数的贡献;将带有权重的向量矩阵作为模型输入,将对话情绪参数作为模型输出训练初始情绪判定模型,可以得到能准确进行情绪选取的情绪判定模型。In this embodiment, after the training corpus is obtained, the speech feature parameters are extracted from the historical dialogue data of the training corpus, and weights are assigned to the speech feature parameters and user labels, so as to differentiate the contributions of the speech feature parameters and user labels to the dialog emotion parameters; The vector matrix with weights is used as the model input, and the dialogue emotion parameters are used as the model output to train the initial emotion judgment model, and an emotion judgment model that can accurately select emotions can be obtained.
在本实施例的一些可选的实现方式中,语音对话数据处理装置300还可以包括:模型训练模块,用于:在Gpipe库中,基于遗传算法,通过训练语料对初始情绪判定模型进行训练,得到情绪判定模型。In some optional implementations of this embodiment, the apparatus 300 for processing speech dialogue data may further include: a model training module, configured to: in the Gpipe library, based on a genetic algorithm, train the initial emotion judgment model through training corpora, Get the emotion judgment model.
本实施例中,在Gpipe库中,基于遗传算法对初始情绪判定模型进行训练,保证了训练得到的情绪判定模型的准确性。In this embodiment, in the Gpipe library, the initial emotion determination model is trained based on the genetic algorithm, which ensures the accuracy of the emotion determination model obtained by training.
在本实施例的一些可选的实现方式中,语音调整模块304可以包括:语义解析子模块、标准选取子模块、方式查询子模块以及语音调整子模块,其中:In some optional implementations of this embodiment, the speech adjustment module 304 may include: a semantic parsing submodule, a standard selection submodule, a mode query submodule, and a speech adjustment submodule, wherein:
语义解析子模块,用于对通话语音信息进行语义解析,得到语义解析结果。The semantic parsing sub-module is used to perform semantic parsing on the voice information of the call to obtain the semantic parsing result.
标准选取子模块,用于从预先录制好的标准对话语音中选取与语义解析结果对应的标准对话语音。The standard selection sub-module is used to select the standard dialogue speech corresponding to the semantic analysis result from the pre-recorded standard dialogue speech.
方式查询子模块,用于基于机器对话情绪参数,查询标准对话语音的语音调整方式,语音调整方式包括声学调整方式和语气词调整方式。The mode query sub-module is used to query the voice adjustment mode of the standard dialogue voice based on the machine dialogue emotion parameter, and the voice adjustment mode includes the acoustic adjustment mode and the modal particle adjustment mode.
语音调整子模块,用于根据语音调整方式,对标准对话语音进行语音调整,得到适配对话语音。The voice adjustment sub-module is used to adjust the standard dialogue voice according to the voice adjustment method to obtain the adapted dialogue voice.
本实施例中,对通话语音信息进行语义解析从而选取语义匹配的标准对话语音,保证人机对话在语义上的合理性;查询与机器对话情绪参数对应的语音调整方式,从而根据语音调整方式对标准对话语音进行声学调整和语气词调整,从而得到带有情绪的适配对话语音。In this embodiment, semantic analysis is performed on the call voice information to select a standard dialogue voice that matches semantically, so as to ensure the semantic rationality of the human-machine dialogue; the voice adjustment method corresponding to the emotional parameter of the machine dialogue is queried, so as to adjust the standard dialogue according to the voice adjustment method. Acoustic adjustment and modal particle adjustment are performed on the dialogue voice, so as to obtain an adapted dialogue voice with emotion.
在本实施例的一些可选的实现方式中,语音对话数据处理装置300还可以包括:信息导入模块、通话确定模块以及通话转接模块,其中:In some optional implementations of this embodiment, the apparatus 300 for processing voice dialogue data may further include: an information import module, a call determination module, and a call transfer module, wherein:
信息导入模块,用于将当前通话的通话语音信息导入预先建立的意图识别模型,得到用户意图识别结果。The information import module is used to import the voice information of the current call into the pre-established intent recognition model to obtain the user intent recognition result.
通话确定模块,用于根据意图识别结果确定当前通话是否需要人工介入。The call determination module is used to determine whether the current call requires manual intervention according to the intention recognition result.
通话转接模块,用于当当前通话需要人工介入时,将当前通话转接给人工坐席账号登录的终端。The call transfer module is used to transfer the current call to the terminal logged in with the manual agent account when the current call requires manual intervention.
本实施例中,在人机对话中进行意图检测,当意图检测结果表明当前通话需要人工介入时,将当前通话转接至人工坐席账号登录的终端,将人工坐席及时引入人机对话,以提升人机对话交互的智能性。In this embodiment, intention detection is performed in the man-machine dialogue. When the intention detection result indicates that the current call requires manual intervention, the current call is transferred to the terminal logged in with the account of the manual agent, and the manual agent is introduced into the man-machine dialogue in time to improve the The intelligence of human-computer dialogue interaction.
在本实施例的一些可选的实现方式中,通话转接模块可以包括:获取子模块、信息转 换子模块以及通话转接子模块,其中:In some optional implementations of the present embodiment, the call transfer module may include: an acquisition submodule, an information conversion submodule, and a call transfer submodule, wherein:
获取子模块,用于当当前通话需要人工介入时,获取当前通话的通话语音信息以及当前通话中用户的用户标签。The acquisition sub-module is used to acquire the call voice information of the current call and the user tag of the user in the current call when the current call requires manual intervention.
信息转换子模块,用于将通话语音信息转换为通话文本。The information conversion submodule is used to convert the voice information of the call into the text of the call.
通话转接子模块,用于将当前通话转接给人工坐席账号登录的终端,并将通话文本和用户标签发送至终端进行展示。The call transfer sub-module is used to transfer the current call to the terminal logged in with the manual agent account, and send the call text and user label to the terminal for display.
本实施例中,将通话转接给人工坐席账号登录的终端时,将对话文本和用户标签一并发送至终端,使得对话可以在之前的基础上进行,而不必重新沟通,提高了对话交互的效率和智能性。In this embodiment, when the call is transferred to the terminal logged in with the artificial agent account, the text of the dialogue and the user label are sent to the terminal together, so that the dialogue can be conducted on the basis of the previous one without having to re-communicate, which improves the interaction of the dialogue. Efficiency and intelligence.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机语音对话交互。The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can interact with the user through a keyboard, mouse, remote control, touch pad, or voice-activated device or the like for human-machine voice dialogue interaction.
所述存储器41至少包括一种类型的计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如语音对话数据处理方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 41 includes at least one type of computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 . In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed in the computer device 4 , such as computer-readable instructions of a method for processing voice dialogue data. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述语音对话数据处理方法的计算机可读指令。The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions or process data stored in the memory 41, for example, computer-readable instructions for executing the voice dialogue data processing method.
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
本实施例中提供的计算机设备可以执行上述语音对话数据处理方法。此处语音对话数据处理方法可以是上述各个实施例的语音对话数据处理方法。The computer device provided in this embodiment can execute the above-mentioned voice dialogue data processing method. The voice dialogue data processing method here may be the voice dialogue data processing methods of the above embodiments.
本实施例中,接收到语音对话数据处理指令后,获取当前通话的通话语音信息以及当前通话中用户的用户标签,用户标签可以表征用户的个人信息;将通话语音信息和用户标签转换为带有权重的向量矩阵,向量矩阵融合了用户通话时的语音特征以及用户的个人信息,情绪判定模型对向量矩阵进行处理并映射得到机器对话情绪参数,机器对话情绪参数表征了机器所应采用的情绪类别以及强烈程度,根据机器对话情绪参数对标准对话语音进 行声学调整和语气词调整,得到适配对话语音,实现了人机对话时根据用户的对话情绪和个人信息针对性地选择对话情绪,提高了人机语音对话交互的智能性。In this embodiment, after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information; The vector matrix of weights. The vector matrix combines the voice characteristics of the user and the personal information of the user. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The machine dialogue emotion parameters represent the emotion category that the machine should adopt. As well as the intensity, the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue. The intelligence of human-machine voice dialogue interaction.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的语音对话数据处理方法的步骤。The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the voice dialogue data processing method as described above.
本实施例中,接收到语音对话数据处理指令后,获取当前通话的通话语音信息以及当前通话中用户的用户标签,用户标签可以表征用户的个人信息;将通话语音信息和用户标签转换为带有权重的向量矩阵,向量矩阵融合了用户通话时的语音特征以及用户的个人信息,情绪判定模型对向量矩阵进行处理并映射得到机器对话情绪参数,机器对话情绪参数表征了机器所应采用的情绪类别以及强烈程度,根据机器对话情绪参数对标准对话语音进行声学调整和语气词调整,得到适配对话语音,实现了人机对话时根据用户的对话情绪和个人信息针对性地选择对话情绪,提高了人机语音对话交互的智能性。In this embodiment, after receiving the voice dialogue data processing instruction, the call voice information of the current call and the user tag of the user in the current call are obtained, and the user tag can represent the user's personal information; The vector matrix of weights. The vector matrix combines the voice characteristics of the user and the personal information of the user. The emotion judgment model processes the vector matrix and maps to obtain the machine dialogue emotion parameters. The machine dialogue emotion parameters represent the emotion category that the machine should adopt. As well as the intensity, the standard dialogue speech is acoustically adjusted and the modal particle is adjusted according to the machine dialogue emotion parameters, and the adapted dialogue speech is obtained, so that the dialogue emotion can be selected according to the user's dialogue emotion and personal information during the human-machine dialogue. The intelligence of human-machine voice dialogue interaction.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims (20)

  1. 一种语音对话数据处理方法,包括下述步骤:A method for processing voice dialogue data, comprising the following steps:
    根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
    将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
    根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
    基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
  2. 根据权利要求1所述的语音对话数据处理方法,其中,所述根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签的步骤之前,还包括:The method for processing voice dialogue data according to claim 1, wherein before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the method further comprises:
    根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识;According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;
    获取所述用户标识所对应用户标签,并将所述用户标签转换为初始向量矩阵;Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;
    将所述初始向量矩阵输入情绪判定模型,得到初始对话情绪参数;Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;
    根据所述初始对话情绪参数对预先录制好的初始标准对话语音进行语音调整,得到初始适配对话语音;Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;
    基于所述初始适配对话语音进行人机对话,并对所述人机对话进行语音监听,得到当前通话的通话语音信息。A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
  3. 根据权利要求2所述的语音对话数据处理方法,其中,所述根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识的步骤之前,还包括:The method for processing voice dialogue data according to claim 2, wherein before the step of acquiring the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the method further comprises:
    获取训练语料,所述训练语料包括用户标签、历史对话语料和对话情绪参数;acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;
    提取所述历史对话语料的语音特征参数;extracting the speech feature parameters of the historical dialogue material;
    给所述语音特征参数和所述用户标签进行权重分配,以生成带有权重的向量矩阵;Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;
    将所述带有权重的向量矩阵作为模型输入,将所述对话情绪参数作为模型输出,对初始情绪判定模型进行训练,得到情绪判定模型。The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
  4. 根据权利要求2所述的语音对话数据处理方法,其中,所述根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识的步骤之前,还包括:The method for processing voice dialogue data according to claim 2, wherein before the step of acquiring the user identification in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the method further comprises:
    在Gpipe库中,基于遗传算法,通过训练语料对初始情绪判定模型进行训练,得到情绪判定模型。In the Gpipe library, based on the genetic algorithm, the initial emotion judgment model is trained through the training corpus, and the emotion judgment model is obtained.
  5. 根据权利要求1所述的语音对话数据处理方法,其中,所述根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音的步骤包括:The voice dialogue data processing method according to claim 1, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:
    对所述通话语音信息进行语义解析,得到语义解析结果;Perform semantic analysis on the voice information of the call to obtain a semantic analysis result;
    从预先录制好的标准对话语音中选取与所述语义解析结果对应的标准对话语音;Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;
    基于所述机器对话情绪参数,查询所述标准对话语音的语音调整方式,所述语音调整方式包括声学调整方式和语气词调整方式;Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;
    根据所述语音调整方式,对所述标准对话语音进行语音调整,得到适配对话语音。According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
  6. 根据权利要求1所述的语音对话数据处理方法,其中,在所述基于所述适配对话语音进行人机对话的步骤之后,还包括:The method for processing voice dialogue data according to claim 1, wherein after the step of conducting a man-machine dialogue based on the adapted dialogue voice, the method further comprises:
    将所述当前通话的通话语音信息导入预先建立的意图识别模型,得到用户意图识别结果;Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;
    根据所述意图识别结果确定所述当前通话是否需要人工介入;Determine whether the current call requires manual intervention according to the intent recognition result;
    当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端。When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
  7. 根据权利要求6所述的语音对话数据处理方法,其中,所述当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端的步骤包括:The voice dialogue data processing method according to claim 6, wherein, when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:
    当所述当前通话需要人工介入时,获取所述当前通话的通话语音信息以及所述当前通话中用户的用户标签;When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息转换为通话文本;converting the call voice information into call text;
    将所述当前通话转接给人工坐席账号登录的终端,并将所述通话文本和所述用户标签发送至所述终端进行展示。The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.
  8. 一种语音对话数据处理装置,包括:A voice dialogue data processing device, comprising:
    获取模块,用于根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;an acquisition module, configured to acquire the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction;
    转换模块,用于将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;a conversion module for converting the call voice information and the user label into a vector matrix with weights;
    矩阵输入模块,用于将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;a matrix input module for inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
    语音调整模块,用于根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;A voice adjustment module, configured to perform voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
    人机对话模块,用于基于所述适配对话语音进行人机对话。A human-machine dialogue module is used to conduct human-machine dialogue based on the adapted dialogue voice.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
    根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
    将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
    根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain an adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
    基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
  10. 根据权利要求9所述的计算机设备,其中,所述根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein, before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the processor executes the When the computer-readable instructions are described, the following steps are also implemented:
    根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识;According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;
    获取所述用户标识所对应用户标签,并将所述用户标签转换为初始向量矩阵;Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;
    将所述初始向量矩阵输入情绪判定模型,得到初始对话情绪参数;Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;
    根据所述初始对话情绪参数对预先录制好的初始标准对话语音进行语音调整,得到初始适配对话语音;Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;
    基于所述初始适配对话语音进行人机对话,并对所述人机对话进行语音监听,得到当前通话的通话语音信息。A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
  11. 根据权利要求10所述的计算机设备,其中,所述根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 10, wherein before the step of acquiring the user identifier in the man-machine dialogue starting instruction according to the received man-machine dialogue starting instruction, the processor executes the computer-readable The command also implements the following steps:
    获取训练语料,所述训练语料包括用户标签、历史对话语料和对话情绪参数;acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;
    提取所述历史对话语料的语音特征参数;extracting the speech feature parameters of the historical dialogue material;
    给所述语音特征参数和所述用户标签进行权重分配,以生成带有权重的向量矩阵;Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;
    将所述带有权重的向量矩阵作为模型输入,将所述对话情绪参数作为模型输出,对初始情绪判定模型进行训练,得到情绪判定模型。The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
  12. 根据权利要求9所述的计算机设备,其中,所述根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音的步骤包括:The computer equipment according to claim 9, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:
    对所述通话语音信息进行语义解析,得到语义解析结果;Perform semantic analysis on the voice information of the call to obtain a semantic analysis result;
    从预先录制好的标准对话语音中选取与所述语义解析结果对应的标准对话语音;Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;
    基于所述机器对话情绪参数,查询所述标准对话语音的语音调整方式,所述语音调整方式包括声学调整方式和语气词调整方式;Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;
    根据所述语音调整方式,对所述标准对话语音进行语音调整,得到适配对话语音。According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
  13. 根据权利要求9所述的计算机设备,其中,在所述基于所述适配对话语音进行人机对话的步骤之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein, after the step of conducting a human-machine dialogue based on the adapted dialogue voice, the processor further implements the following steps when executing the computer-readable instructions:
    将所述当前通话的通话语音信息导入预先建立的意图识别模型,得到用户意图识别结果;Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;
    根据所述意图识别结果确定所述当前通话是否需要人工介入;Determine whether the current call requires manual intervention according to the intent recognition result;
    当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端。When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
  14. 根据权利要求13所述的计算机设备,其中,所述当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端的步骤包括:The computer device according to claim 13, wherein when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:
    当所述当前通话需要人工介入时,获取所述当前通话的通话语音信息以及所述当前通话中用户的用户标签;When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息转换为通话文本;converting the call voice information into call text;
    将所述当前通话转接给人工坐席账号登录的终端,并将所述通话文本和所述用户标签发送至所述终端进行展示。The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令;其中,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium on which computer-readable instructions are stored; wherein the computer-readable instructions are executed by a processor to achieve the following steps:
    根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签;According to the triggered voice dialogue data processing instruction, obtain the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息和所述用户标签转换为带有权重的向量矩阵;Converting the call voice information and the user label into a vector matrix with weights;
    将所述带有权重的向量矩阵输入情绪判定模型,得到机器对话情绪参数;Inputting the weighted vector matrix into an emotion judgment model to obtain a machine dialogue emotion parameter;
    根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音,其中,所述语音调整包括声学调整和语气词调整;The voice adjustment is performed on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice, wherein the voice adjustment includes acoustic adjustment and modal particle adjustment;
    基于所述适配对话语音进行人机对话。A human-machine dialogue is performed based on the adapted dialogue voice.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述根据触发的语音对话数据处理指令,获取当前通话的通话语音信息以及所述当前通话中用户的用户标签的步骤之前,所述计算机可读指令被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 15, wherein, before the step of acquiring the call voice information of the current call and the user tag of the user in the current call according to the triggered voice dialogue data processing instruction, the computer The readable instructions also implement the following steps when executed by the processor:
    根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识;According to the received man-machine dialogue start instruction, obtain the user ID in the man-machine dialogue start instruction;
    获取所述用户标识所对应用户标签,并将所述用户标签转换为初始向量矩阵;Obtain the user label corresponding to the user identification, and convert the user label into an initial vector matrix;
    将所述初始向量矩阵输入情绪判定模型,得到初始对话情绪参数;Inputting the initial vector matrix into an emotion judgment model to obtain an initial dialogue emotion parameter;
    根据所述初始对话情绪参数对预先录制好的初始标准对话语音进行语音调整,得到初始适配对话语音;Perform voice adjustment on the pre-recorded initial standard dialogue voice according to the initial dialogue emotion parameter to obtain the initial adaptation dialogue voice;
    基于所述初始适配对话语音进行人机对话,并对所述人机对话进行语音监听,得到当前通话的通话语音信息。A man-machine dialogue is performed based on the initially adapted dialogue voice, and voice monitoring is performed on the man-machine dialogue, so as to obtain the call voice information of the current call.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据接收到的人机对话启动指令,获取所述人机对话启动指令中的用户标识的步骤之前,所述计算机可读指令被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 16, wherein before the step of acquiring the user identification in the man-machine dialogue initiation instruction according to the received man-machine dialogue initiation instruction, the computer-readable instruction is executed by The processor also implements the following steps when executing:
    获取训练语料,所述训练语料包括用户标签、历史对话语料和对话情绪参数;acquiring training corpus, where the training corpus includes user labels, historical dialogue data and dialogue emotion parameters;
    提取所述历史对话语料的语音特征参数;extracting the speech feature parameters of the historical dialogue material;
    给所述语音特征参数和所述用户标签进行权重分配,以生成带有权重的向量矩阵;Perform weight assignment to the voice feature parameter and the user label to generate a vector matrix with weights;
    将所述带有权重的向量矩阵作为模型输入,将所述对话情绪参数作为模型输出,对初始情绪判定模型进行训练,得到情绪判定模型。The weighted vector matrix is used as the model input, the dialogue emotion parameter is used as the model output, and the initial emotion judgment model is trained to obtain the emotion judgment model.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述机器对话情绪参数对预先录制好的标准对话语音进行语音调整,得到适配对话语音的步骤包括:The computer-readable storage medium according to claim 15, wherein the step of performing voice adjustment on the pre-recorded standard dialogue voice according to the machine dialogue emotion parameter to obtain the adapted dialogue voice comprises:
    对所述通话语音信息进行语义解析,得到语义解析结果;Semantic analysis is performed on the voice information of the call to obtain a semantic analysis result;
    从预先录制好的标准对话语音中选取与所述语义解析结果对应的标准对话语音;Select the standard dialogue voice corresponding to the semantic analysis result from the pre-recorded standard dialogue voice;
    基于所述机器对话情绪参数,查询所述标准对话语音的语音调整方式,所述语音调整方式包括声学调整方式和语气词调整方式;Based on the machine dialogue emotion parameter, query the voice adjustment mode of the standard dialogue voice, and the voice adjustment mode includes an acoustic adjustment mode and a modal particle adjustment mode;
    根据所述语音调整方式,对所述标准对话语音进行语音调整,得到适配对话语音。According to the voice adjustment method, voice adjustment is performed on the standard dialogue voice to obtain an adapted dialogue voice.
  19. 根据权利要求15所述的计算机可读存储介质,其中,在所述基于所述适配对话语音进行人机对话的步骤之后,所述计算机可读指令被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 15, wherein, after the step of conducting a human-machine dialogue based on the adapted dialogue voice, the computer-readable instruction further implements the following steps when executed by the processor:
    将所述当前通话的通话语音信息导入预先建立的意图识别模型,得到用户意图识别结果;Import the call voice information of the current call into a pre-established intent recognition model to obtain a user intent recognition result;
    根据所述意图识别结果确定所述当前通话是否需要人工介入;Determine whether the current call requires manual intervention according to the intent recognition result;
    当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端。When the current call requires manual intervention, the current call is transferred to the terminal logged in with the manual agent account.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述当所述当前通话需要人工介入时,将所述当前通话转接给人工坐席账号登录的终端的步骤包括:The computer-readable storage medium according to claim 19, wherein when the current call requires manual intervention, the step of transferring the current call to a terminal logged in with a manual agent account comprises:
    当所述当前通话需要人工介入时,获取所述当前通话的通话语音信息以及所述当前通话中用户的用户标签;When the current call requires manual intervention, acquiring the call voice information of the current call and the user tag of the user in the current call;
    将所述通话语音信息转换为通话文本;converting the call voice information into call text;
    将所述当前通话转接给人工坐席账号登录的终端,并将所述通话文本和所述用户标签发送至所述终端进行展示。The current call is transferred to the terminal logged in with the manual agent account, and the call text and the user tag are sent to the terminal for display.
PCT/CN2021/090173 2021-02-26 2021-04-27 Voice conversation data processing method and apparatus, and computer device and storage medium WO2022178969A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110218920.0A CN112967725A (en) 2021-02-26 2021-02-26 Voice conversation data processing method and device, computer equipment and storage medium
CN202110218920.0 2021-02-26

Publications (1)

Publication Number Publication Date
WO2022178969A1 true WO2022178969A1 (en) 2022-09-01

Family

ID=76276097

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090173 WO2022178969A1 (en) 2021-02-26 2021-04-27 Voice conversation data processing method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112967725A (en)
WO (1) WO2022178969A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153330A (en) * 2023-04-04 2023-05-23 杭州度言软件有限公司 Intelligent telephone voice robot control method
CN116849659A (en) * 2023-09-04 2023-10-10 深圳市昊岳科技有限公司 Intelligent emotion bracelet for monitoring driver state and monitoring method thereof

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113676602A (en) * 2021-07-23 2021-11-19 上海原圈网络科技有限公司 Method and device for processing manual transfer in automatic response
CN114218424B (en) * 2022-02-22 2022-05-13 杭州一知智能科技有限公司 Voice interaction method and system for tone word insertion based on wav2vec
CN115134655B (en) * 2022-06-28 2023-08-11 中国平安人寿保险股份有限公司 Video generation method and device, electronic equipment and computer readable storage medium
CN115460323A (en) * 2022-09-06 2022-12-09 上海浦东发展银行股份有限公司 Method, device, equipment and storage medium for intelligent external call transfer
CN117711399B (en) * 2024-02-06 2024-05-03 深圳市瑞得信息科技有限公司 Interactive AI intelligent robot control method and intelligent robot

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570496A (en) * 2016-11-22 2017-04-19 上海智臻智能网络科技股份有限公司 Emotion recognition method and device and intelligent interaction method and device
US20180136615A1 (en) * 2016-11-15 2018-05-17 Roborus Co., Ltd. Concierge robot system, concierge service method, and concierge robot
WO2018093806A1 (en) * 2016-11-15 2018-05-24 JIBO, Inc. Embodied dialog and embodied speech authoring tools for use with an expressive social robot
CN110648691A (en) * 2019-09-30 2020-01-03 北京淇瑀信息科技有限公司 Emotion recognition method, device and system based on energy value of voice
CN110931006A (en) * 2019-11-26 2020-03-27 深圳壹账通智能科技有限公司 Intelligent question-answering method based on emotion analysis and related equipment

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
CN109036405A (en) * 2018-07-27 2018-12-18 百度在线网络技术(北京)有限公司 Voice interactive method, device, equipment and storage medium
CN111368609B (en) * 2018-12-26 2023-10-17 深圳Tcl新技术有限公司 Speech interaction method based on emotion engine technology, intelligent terminal and storage medium
CN110021308B (en) * 2019-05-16 2021-05-18 北京百度网讯科技有限公司 Speech emotion recognition method and device, computer equipment and storage medium
CN110570295A (en) * 2019-07-25 2019-12-13 深圳壹账通智能科技有限公司 Resource collection method and device, computer equipment and storage medium
CN110990543A (en) * 2019-10-18 2020-04-10 平安科技(深圳)有限公司 Intelligent conversation generation method and device, computer equipment and computer storage medium
CN111028827B (en) * 2019-12-10 2023-01-24 深圳追一科技有限公司 Interaction processing method, device, equipment and storage medium based on emotion recognition
CN111193834B (en) * 2019-12-16 2022-04-15 北京淇瑀信息科技有限公司 Man-machine interaction method and device based on user sound characteristic analysis and electronic equipment
CN111241822A (en) * 2020-01-03 2020-06-05 北京搜狗科技发展有限公司 Emotion discovery and dispersion method and device under input scene
CN111246027B (en) * 2020-04-28 2021-02-12 南京硅基智能科技有限公司 Voice communication system and method for realizing man-machine cooperation
CN111564202B (en) * 2020-04-30 2021-05-28 深圳市镜象科技有限公司 Psychological counseling method based on man-machine conversation, psychological counseling terminal and storage medium
CN111739516A (en) * 2020-06-19 2020-10-02 中国—东盟信息港股份有限公司 Speech recognition system for intelligent customer service call
CN111696556B (en) * 2020-07-13 2023-05-16 上海茂声智能科技有限公司 Method, system, equipment and storage medium for analyzing user dialogue emotion
CN111916111B (en) * 2020-07-20 2023-02-03 中国建设银行股份有限公司 Intelligent voice outbound method and device with emotion, server and storage medium
CN111885273B (en) * 2020-07-24 2021-10-15 南京易米云通网络科技有限公司 Man-machine cooperation controllable intelligent voice outbound method and intelligent outbound robot platform
CN112185389A (en) * 2020-09-22 2021-01-05 北京小米松果电子有限公司 Voice generation method and device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180136615A1 (en) * 2016-11-15 2018-05-17 Roborus Co., Ltd. Concierge robot system, concierge service method, and concierge robot
WO2018093806A1 (en) * 2016-11-15 2018-05-24 JIBO, Inc. Embodied dialog and embodied speech authoring tools for use with an expressive social robot
CN106570496A (en) * 2016-11-22 2017-04-19 上海智臻智能网络科技股份有限公司 Emotion recognition method and device and intelligent interaction method and device
CN110648691A (en) * 2019-09-30 2020-01-03 北京淇瑀信息科技有限公司 Emotion recognition method, device and system based on energy value of voice
CN110931006A (en) * 2019-11-26 2020-03-27 深圳壹账通智能科技有限公司 Intelligent question-answering method based on emotion analysis and related equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116153330A (en) * 2023-04-04 2023-05-23 杭州度言软件有限公司 Intelligent telephone voice robot control method
CN116153330B (en) * 2023-04-04 2023-06-23 杭州度言软件有限公司 Intelligent telephone voice robot control method
CN116849659A (en) * 2023-09-04 2023-10-10 深圳市昊岳科技有限公司 Intelligent emotion bracelet for monitoring driver state and monitoring method thereof
CN116849659B (en) * 2023-09-04 2023-11-17 深圳市昊岳科技有限公司 Intelligent emotion bracelet for monitoring driver state and monitoring method thereof

Also Published As

Publication number Publication date
CN112967725A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
JP6876752B2 (en) Response method and equipment
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
WO2022188734A1 (en) Speech synthesis method and apparatus, and readable storage medium
WO2022078146A1 (en) Speech recognition method and apparatus, device, and storage medium
CN111312245B (en) Voice response method, device and storage medium
WO2021114841A1 (en) User report generating method and terminal device
WO2021218029A1 (en) Artificial intelligence-based interview method and apparatus, computer device, and storage medium
CN111710337B (en) Voice data processing method and device, computer readable medium and electronic equipment
CN110851650B (en) Comment output method and device and computer storage medium
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN114330551A (en) Multi-modal emotion analysis method based on multi-task learning and attention layer fusion
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113761268A (en) Playing control method, device, equipment and storage medium of audio program content
Kumar et al. Machine learning based speech emotions recognition system
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
Yu Research on multimodal music emotion recognition method based on image sequence
Johar Paralinguistic profiling using speech recognition
Kai [Retracted] Optimization of Music Feature Recognition System for Internet of Things Environment Based on Dynamic Time Regularization Algorithm
US10706086B1 (en) Collaborative-filtering based user simulation for dialog systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927403

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927403

Country of ref document: EP

Kind code of ref document: A1