WO2021139737A1 - Method and system for man-machine interaction - Google Patents
Method and system for man-machine interaction Download PDFInfo
- Publication number
- WO2021139737A1 WO2021139737A1 PCT/CN2021/070720 CN2021070720W WO2021139737A1 WO 2021139737 A1 WO2021139737 A1 WO 2021139737A1 CN 2021070720 W CN2021070720 W CN 2021070720W WO 2021139737 A1 WO2021139737 A1 WO 2021139737A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- response
- feature
- data
- emotion
- sample
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 169
- 230000003993 interaction Effects 0.000 title claims abstract description 115
- 230000004044 response Effects 0.000 claims abstract description 476
- 230000008451 emotion Effects 0.000 claims abstract description 267
- 238000012545 processing Methods 0.000 claims abstract description 119
- 238000013145 classification model Methods 0.000 claims description 185
- 238000012549 training Methods 0.000 claims description 121
- 230000002452 interceptive effect Effects 0.000 claims description 106
- 230000008569 process Effects 0.000 claims description 72
- 230000002996 emotional effect Effects 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 20
- 238000013527 convolutional neural network Methods 0.000 claims description 16
- 238000009432 framing Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 claims 12
- 230000002596 correlated effect Effects 0.000 claims 1
- 230000008909 emotion recognition Effects 0.000 description 44
- 238000010586 diagram Methods 0.000 description 26
- 230000006870 function Effects 0.000 description 24
- 238000004891 communication Methods 0.000 description 20
- 238000004458 analytical method Methods 0.000 description 16
- 238000012360 testing method Methods 0.000 description 13
- 238000011156 evaluation Methods 0.000 description 12
- 239000013598 vector Substances 0.000 description 11
- 230000006399 behavior Effects 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000001815 facial effect Effects 0.000 description 7
- 230000001755 vocal effect Effects 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 239000002245 particle Substances 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000010998 test method Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000001356 surgical procedure Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 206010022998 Irritability Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 210000001747 pupil Anatomy 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- This specification relates to the field of computer technology, in particular to a human-computer interaction method and system.
- terminals can realize automatic responses to users to realize human-computer interaction.
- the terminal can determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user.
- the terminal generally responds according to the default tone, tone or content, but it cannot meet the user's personalized interaction needs.
- the embodiment of this specification proposes a human-computer interaction method and system to realize personalized interaction for different languages.
- One of the embodiments of this specification provides a method of human-computer interaction, the method includes: extracting an associated feature of the target object based on an interaction instruction directed to the target object, and the associated feature is related to the interaction instruction and the target object At least one of the historical data; the associated feature includes the first feature corresponding to the interactive instruction; if the interactive instruction does not include voice data, the first feature includes the text data corresponding to the interactive instruction Text features; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; based on the correlation feature Processing, determining a response strategy for the target object, the response strategy being related to at least one of response content, response style, and response emotion; and, based on the response strategy, determining a response strategy for the target object .
- One of the embodiments of the present specification provides a human-computer interaction system, the system includes: an extraction module for extracting the associated feature of the target object based on the interaction instruction for the target object, the associated feature and the interaction instruction Related to at least one of the historical data of the target object; the associated feature includes the first feature corresponding to the interaction instruction; if the interaction instruction does not include voice data, the first feature includes the interaction instruction The text feature of the corresponding text data; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; determining module , For determining a response strategy for the target object based on processing the associated features, the response strategy being related to at least one of response content, response style, and response emotion; and, a response module for determining a response based on The response strategy determines the response words for the target object.
- One of the embodiments of the present application provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned human-computer interaction method.
- Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application
- Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application;
- Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application;
- Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application.
- Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Fig. 7 is an exemplary flow chart of training a first classification model according to some embodiments of the present application.
- Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application.
- Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application.
- FIG. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- FIG. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- FIG. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- FIG. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- FIG. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Figure 18 is a block diagram of a terminal according to some embodiments of the present application.
- Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- FIG. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application.
- FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application.
- Figure 22 is a block diagram of a terminal according to some embodiments of the present application.
- FIG. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- FIG. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Fig. 25 is a block diagram of a terminal according to some embodiments of the present application.
- system used herein is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels.
- the words can be replaced by other expressions.
- Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application.
- the human-computer interaction system 100 may include a server 110, a network 120, a first client 130, a second client 140 and a storage 150.
- the server 110 may process data and/or information obtained from at least one component of the system 100 (for example, the first client 130, the second client 140, and the storage 150) or an external data source (for example, a cloud data center). For example, the server 110 may obtain the interaction instruction from the first user terminal 130 (for example, the passenger terminal). For another example, the server 110 may also obtain historical data from the storage 150.
- the server 110 may process data and/or information obtained from at least one component of the system 100 (for example, the first client 130, the second client 140, and the storage 150) or an external data source (for example, a cloud data center).
- the server 110 may obtain the interaction instruction from the first user terminal 130 (for example, the passenger terminal).
- the server 110 may also obtain historical data from the storage 150.
- the server 110 may include a processing device 112.
- the processing device 112 may process information and/or data related to the human-computer interaction system to perform one or more functions described in this specification. For example, the processing device 112 may determine response speech based on interactive instructions and/or historical data.
- the processing device 112 may include at least one processing unit (for example, a single-core processing engine or a multi-core processing engine). In some embodiments, the processing device 112 may be a part of the first client 130 and/or the second client 140.
- the network 120 may provide channels for information exchange.
- the network 120 may include one or more network access points.
- One or more components of the system 100 may be connected to the network 120 through an access point to exchange data and/or information.
- at least one component in the system 100 can access data or instructions stored in the memory 150 via the network 120.
- the owner of the first user terminal 130 may be the user himself or someone other than the user himself.
- the owner A of the first client 130 may use the first client 130 to send a service request for the user B.
- the first client 130 may include various types of devices with information receiving and/or sending functions.
- the first client 130 can process information and/or data.
- the first client 130 may be a device with a positioning function.
- the first client 130 may be a device with a display function, and the response words fed back to the first client 130 by the display server 110 may be displayed in an interface, pop-up window, floating window, small window, text, etc.
- the first client 130 may be a device with a voice function, so that the response words fed back by the server 110 to the first client 130 can be played.
- the second client 140 can communicate with the first client 130.
- the first client 130 and the second client 140 may communicate through a short-range communication device.
- the type of the second client 140 may be the same as or different from that of the first client 130.
- the first client 130 and the second client 140 may include, but are not limited to, a tablet computer, a notebook computer, a mobile device, a desktop computer, etc., or any combination thereof.
- the memory 150 may store data and/or instructions that can be executed or used by the processing device 112 to complete the exemplary methods described in this specification.
- the memory 150 may store historical data, a model used to determine the response speech, an audio file of the response speech, a text file, and the like.
- the storage 150 may be directly connected to the server 110 as a back-end storage.
- the storage 150 may be a part of the server 110, the first client 130 and/or the second client 140.
- Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application.
- the computing device 200 may include a processor 210, a memory 220, an input/output 230, and a communication port 240.
- the processor 210 can execute calculation instructions (program code) and perform the functions of the human-computer interaction system 100 described in the present invention.
- the calculation instructions may include programs, objects, components, data structures, procedures, modules, and functions (the functions refer to the specific functions described in the present invention).
- the processor 210 may process image or text data obtained from any other components of the human-computer interaction system 100.
- the computing device 200 in FIG. 2 only describes one processor, but it should be noted that the computing device 200 in the present invention may also include multiple processors.
- the memory 220 may store data/information obtained from any other components of the on-demand service system 100.
- the input/output 230 may be used to input or output signals, data or information.
- the input/output 230 may enable the user to communicate with the human-computer interaction system 100.
- the input/output 230 may include an input device and an output device.
- the communication port 240 may be connected to a network for data communication.
- the communication port 240 may be a standardized port or a specially designed port.
- Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application.
- the terminal may be implemented by the mobile device 300.
- the mobile device 300 may include a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an input/output 350, a memory 360, and a storage 390.
- the mobile device 300 may also include any other suitable components, including but not limited to a system bus or a controller (not shown in the figure).
- the mobile operating system 370 and one or more application programs 380 may be loaded from the memory 390 into the memory 360 so as to be executable by the central processing unit 340.
- the application program 380 may include a browser or any other suitable mobile application program for receiving and presenting prompt information or other information related information from the server 110.
- the user interaction of the information flow can be implemented through the input/output 350 and provided to the server 110 and/or other components of the human-computer interaction system 100 through the network 120.
- Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application. As shown in FIG. 4, the system may include: an extraction module 410, a determination module 420, a response module 430, and an acquisition module 440.
- the extraction module 410 may be used to extract the associated features of the target object based on the interactive instruction for the target object.
- the associated feature includes a first feature corresponding to an interactive instruction and/or a second feature corresponding to historical data.
- the first feature includes at least one of a voice feature of the voice data in the interactive instruction and/or a text feature of the text data corresponding to the interactive instruction.
- the extraction module 410 may be used to preprocess the interaction instructions before extracting the associated features. See Figure 5 and related descriptions for more details.
- the determining module 420 may be used to determine a response strategy for the target object based on processing the associated features. In some embodiments, the determining module 420 may be used to process the associated features based on the model and determine a response strategy.
- the model may be a first classification model, a second classification model, and a third classification model. See Figures 5, 6, 8 and 10 and related descriptions for more details.
- the response module 430 may be used to determine and output response words for the target object based on the response strategy.
- the response words are output to the target object by responding to text and/or responding to speech.
- the obtaining module 440 may be used to obtain interactive instructions and models. For example, the obtaining module 440 may obtain the model through a training process. The obtaining module 440 may be used to obtain training samples.
- the training sample includes a first training sample, a second training sample, and a third training sample. See Figures 7, 9 and 11 and related descriptions for more details.
- Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the process 500 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 510 Extract the associated feature of the target object based on the interaction instruction for the target object, where the associated feature is related to at least one of the interaction instruction and historical data of the target object.
- step 510 may be performed by the extraction module 410.
- the target audience refers to people who can communicate information with systems or devices (for example, mobile devices, terminal devices, etc.).
- the target object refers to the object that the system or device needs to respond to.
- Target objects may include device-associated objects (for example, users or communicators), debuggers, testers, implementation personnel, maintenance personnel, customer service personnel, and so on.
- the target object is a user to which the current terminal belongs.
- the passenger user belongs to the passenger terminal, the passenger user belongs to the passenger terminal, and so on.
- the current terminal may refer to a device or system that performs human-computer interaction, for example, it may be a terminal that receives interactive instructions. It is understandable that when the target object is the user to which the current terminal belongs, the target object is actually the user who initiated the interactive instruction.
- the target object is a counterpart user who has communication with the current terminal. For example, a driver user who communicates with the passenger terminal and a passenger user who communicates with the driver terminal.
- Interactive instructions are instructions sent to the device.
- An interactive instruction for the target object may refer to an instruction sent to a system or device so that it can determine how to give a response to the target object.
- it may be an interaction instruction sent to the device by an associated object of the device (such as a user of the device).
- the associated objects can convey specific intentions to the device through interactive instructions (for example, praise, praise, praise the driver, praise oneself, complain, etc.), so that the device or system can give a corresponding response.
- the interactive instruction can be obtained from the current terminal.
- the interactive instruction may be in the form of voice, text, video, image, facial motion, gesture, touch screen operation, etc., and any combination thereof.
- the voice can be one or any combination of Chinese, English, French, Japanese, etc.
- the associated feature may refer to the feature related to the target object, the sender of the interactive instruction, and/or the interactive instruction.
- the association feature may be related to at least one of the interaction instruction and the historical data of the target object.
- the associated feature includes the first feature corresponding to the interactive instruction. It can be understood that the first feature is a feature obtained based on an interactive instruction.
- the first feature may include at least one of the voice feature of the voice data in the interaction instruction, the text feature of the text data corresponding to the interaction instruction, and the image feature corresponding to the image data in the interaction instruction.
- the text feature of the text data corresponding to the interactive instruction can be the text data contained in the interactive instruction itself, or the text data obtained by recognizing voice data and/or other data (for example, image data) in the interactive instruction, etc. . For example, it can be recognized by a speech decoder.
- the voice data can be multilingual voice data.
- the voice decoder can have a one-to-many or one-to-one relationship with the language, that is, a voice decoder can convert the voice data of multiple languages into Text data, or, a speech decoder can only decode speech data in a certain language.
- the interactive instruction may include voice data
- the first feature may include a voice feature corresponding to the voice data and/or a text feature of text data corresponding to the interactive instruction.
- the voice may be multilingual
- the voice data may be data corresponding to multiple languages
- the voice feature may be the voice feature corresponding to the multilingual voice data.
- the interaction data may not include voice data
- the first feature may include the text feature of the text data corresponding to the interaction instruction
- the first feature may also include features corresponding to other data in the interactive instruction.
- the interactive instruction includes image data
- the first feature includes image features corresponding to the image data.
- the interaction instruction includes gestures, postures, and other actions
- the first feature may include screen gesture features, facial features, fingerprint features, and the like.
- Screen gesture features represent screen operation information in interactive instructions, such as operations such as sliding, turning pages, and touching.
- the facial features represent the facial features of the user in the interactive instructions.
- the processing device can obtain different interactive instructions according to different facial features.
- the facial features can also include pupil features, facial features, iris features, and the like.
- the fingerprint feature represents the fingerprint information of the user's finger.
- the processing device can obtain different interaction commands according to different fingerprint features.
- the voice feature includes one or a combination of audio features and energy features of the voice data.
- the audio feature of the voice data refers to the audio feature of the voice data.
- the audio feature may include at least one of a fundamental frequency feature, a short-term energy feature, a short-term assignment feature, a short-term zero-crossing rate feature, and the like.
- the fundamental frequency characteristic refers to the characteristic of the sound frequency in the speech data.
- the fundamental frequency corresponds to the frequency of vocal cord vibration and represents the pitch of the sound. The faster the vocal cord vibration, the higher the fundamental frequency.
- the fundamental frequency characteristics of speech data can be used to detect speech noise, special sound detection, gender discrimination, speaker recognition, parameter adaptation, and so on.
- the short-term energy feature refers to the average energy gathered by the sampling point signal in a short-term audio frame.
- a continuous audio signal stream x gets K sampling points, so that K sampling points can be divided into M short-term frames, and the size of each short-term frame and the window function is assumed to be N.
- the m-th short-term frame According to the formula Calculate its short-term energy.
- the short-time zero-crossing rate feature refers to the number of times the signal passes the zero value in each frame. For continuous speech signals with a horizontal axis of time, it can be observed that the time-domain waveform of the speech passes through the horizontal axis. In the case of discrete-time speech signals, if adjacent samples have different algebraic signs, it is called zero-crossing, so the number of zero-crossings can be calculated, that is, the zero-crossing rate.
- the zero-crossing rate can reflect the frequency information of the signal to a certain extent.
- the short-time zero-crossing rate can be used to judge the unvoiced and voiced speech data. A high zero-crossing rate means unvoiced sound, and a low zero-crossing rate means voiced sound.
- the energy feature of voice data refers to the energy distribution in the frequency domain of the voice data, and different energy distributions represent different voice features.
- the frequency domain is a coordinate system used to describe the frequency characteristics of the speech signal.
- the frequency domain diagram may display the energy value of the speech signal in each given frequency band within a frequency range.
- the energy feature includes at least one of Fbank feature and mel-frequency cepstrum (MFCC) feature.
- the MFCC feature refers to the feature of the voice signal obtained by the MFCC method. MFCC features have a good degree of discrimination and are used to identify different sounds. In some embodiments, MFCC features are commonly used for automatic speech and speaker recognition. Regarding the Fbank feature and its extraction, please refer to FIG. 17 and its related description for details, which will not be repeated here.
- the voice features further include linear prediction analysis (Linear Prediction Coefficients, LPC), perceptual linear prediction coefficients (Perceptual Linear Predictive, PLP), Tandem features, Botleneck features, Linear Predictive Cepstral Coefficients (Linear Predictive Cepstral Coefficient, LPCC), resonance waves, Bark Spectrum and so on.
- LPC Linear Prediction Coefficients
- PLP perceptual linear prediction coefficients
- Tandem features Tandem features
- Botleneck features Linear Predictive Cepstral Coefficients (Linear Predictive Cepstral Coefficient, LPCC), resonance waves, Bark Spectrum and so on.
- speech features can be extracted through algorithms or models.
- the algorithm corresponding to the voice feature type is used to extract the voice feature of the corresponding type.
- the MFCC feature is extracted through the triangular band-pass filter algorithm.
- the voice data before extracting the voice feature, the voice data may be pre-processed, and the pre-processing includes at least one of framing processing, pre-enhancement processing, windowing processing, and noise processing.
- Framing processing is used to divide voice data into multiple voice segments, reducing the amount of data processed each time.
- the framing process it can be divided according to a predetermined value or a predetermined range (for example, 10 ms to 30 ms is a frame).
- a predetermined value or a predetermined range for example, 10 ms to 30 ms is a frame.
- an offset can be performed during frame division, so that there is an overlap between two adjacent frames.
- the voice data is a short sentence, and some scenes do not need to be segmented.
- Pre-enhancement processing is used to enhance the high frequency part. Pre-enhancement can be achieved by passing the voice data through a high-pass filter.
- Windowing is used to eliminate signal discontinuity that may be caused at both ends of each frame. For example, multiply each frame by the Hamming window to increase the continuity between the left and right ends of the frame.
- the noise processing may be processing of adding random noise, which can solve the processing errors and omissions of the synthesized audio.
- Noise processing may also include noise reduction processing.
- the noise reduction processing can be achieved by noise reduction algorithms, which can include adaptive filters, spectral subtraction, Wiener filtering, and so on.
- the voice data is data collected in real time, and noise processing is not required in some scenarios.
- the text features represent relevant information of the text data, including but not limited to: keyword features, semantic features, word frequency features, etc.
- the text features may be extracted through algorithms or models, for example, through LSTM, BERT, one-hot, bag-of-words model, term frequency and inverse document frequency (term frequency—inverse document frequency, TF-IDF) model, vocabulary model, etc.
- the historical data of the target object refers to the data generated by the target object in the past period of time, for example, the last week, the last month, or the last three days.
- the historical data may include, but is not limited to: one or more of online voice data, personal account data, user behavior data, or offline record data.
- Online voice data comes from the online voice of the target object, which is the voice of the target object online.
- online voices made in the past period of time.
- it may be a voice interaction instruction in which the target object requested the device to give a response in the past.
- it may be the voice data of the target object communicating with the communication user in the past.
- online voice data can be converted into text data as historical data. For example, after acquiring the historical online voice of the target object and performing voice recognition on it, the corresponding text is obtained, and the text is used as a kind of historical data.
- the personal account data comes from the account information of the target object to which the terminal belongs.
- the personal account data may include, but is not limited to: personality, occupation, number of orders using taxi apps (Application, application), reputation score, gender, age of account use, The number of historical orders, the pick-up and drop-off points in historical orders, the taxi time information in historical orders, etc.
- the target object behavior data may be data generated by historical operations or feedback of the target object. For example, it may be to obtain the evaluation feedback data of the target object's response to the historical push. For another example, it may be the evaluation information of the target object (for example, the driver or the passenger) to the correspondent (for example, the passenger or the driver). For another example, it may include the user's evaluation of historical orders, chat records with customer service, evaluation of customer service, evaluation of the system, evaluation of information pushed by the system, and other information.
- the offline recorded data may be data recorded by the terminal.
- it can be data recorded offline by the terminal.
- the historical data may also include other information, including but not limited to the user's historical input information, the user's geographic information, and the user's identity information.
- the user's historical input information includes the user's historical query information, video input information, voice input information, text input information, screen gesture operation information, unlocking or unlocking information, and so on.
- the user's geographic information includes the user's home address, work address, activity range, etc.
- the user's identity information includes the user's age, work, hometown, height, weight, income, etc.
- the second characteristic may be a characteristic determined based on historical data, including: a characteristic determined based on the historical data of the target object and a characteristic determined based on the historical data of the sender of the interactive instruction.
- the type of the second feature can be determined according to the type of historical data. For example, if the historical data contains voice data, the second feature contains the voice feature of the historical data.
- the extraction method of the second feature is the same as that of the first feature. The features are similar, so I won’t repeat them here.
- time series behavior characteristics can also be extracted based on historical data
- the historical data of the target object or the sender of the interactive instruction can be converted into sequence characteristics in the order of time
- one or more specific behaviors in the historical data can be converted into sequence characteristics.
- Each feature behavior is represented by a vector in the sequence feature.
- specific behaviors include service-related behaviors such as the number of orders placed, and the degree of praise. Therefore, when determining the response strategy based on the second feature, not only the real behavior can be considered, but also the time factor can be considered, so that the evaluation of the target object or the interactive instruction issuer is more accurate, and further, the determination of the response strategy is also more accurate. Is accurate.
- the associated features can be represented by vectors.
- post-processing may be performed on the associated features, for example, normalization processing, etc. For more details about the normalization process, please refer to Figure 20 and its related descriptions.
- Step 520 Determine a response strategy for the target object based on processing the associated features, where the response strategy is related to at least one of response content, response style, and response emotion. In some embodiments, step 520 may be performed by the determining module 420.
- the response strategy refers to a method and/or criterion for responding to interactive instructions, and the response strategy is related to at least one of response content, response style, and response emotion.
- the response content refers to the semantic content of the response to the interactive command.
- the same response content can be expressed in different language content.
- the response content is "the driver's service is very good”, which can be expressed as “the driver's service is very good”, “the driver's service is great”, “the driver's service is really good”, etc.
- the response style refers to the degree of response to interactive commands. For example, when complimenting the other party, there can be: strong praise (strong praise, high praise), normal praise, slight praise (slight praise, low praise), etc.
- Responding emotions refers to the emotions or moods associated with responding to interactive instructions. Response emotions can include loss, calm, enthusiasm, joy, excitement, joy, excitement, sadness, anger, irritability, pain, or passion.
- the associated features can be processed through the model to determine the response strategy for the target object.
- the model is composed of a multi-layer residual network and a multi-layer fully connected network.
- the multi-layer residual network is composed of a convolutional neural network; or a multi-layer convolutional neural network and a multi-layer fully connected network.
- the models may be the first classification model, the second classification model, and the third classification model. For details, see the following text.
- the determination module 420 may process the associated features and determine the response content corresponding to the specified speech. For example, the determining module 420 may process the voice features based on the first classification model, and determine whether the voice data contains a specified dialect in any one of the multiple languages; if the voice data contains a specified dialect, determine the corresponding corresponding to the specified dialect Response content. For details, refer to Fig. 6 and its description, which will not be repeated here.
- the determining module 420 may determine the response emotion based on processing the associated features. For example, the determining module 420 may process at least one of the voice feature and the text feature based on the second classification model to determine the response emotion of the voice data; and based on the response emotion, determine the response emotion. For details, refer to FIG. 13 and its description, which will not be repeated here.
- the determining module 420 may determine the response style based on processing the associated features. For example, the determining module may process at least one of the first feature and the second feature based on the third classification model to determine the response style. Refer to Figure 15 and its description for details, and will not be repeated here.
- the determination module 420 may determine a response strategy based on processing interaction instructions and historical data. For example, the determining module 420 may process at least one of the interaction instruction and historical data based on the response strategy model to determine the response strategy.
- the response strategy model may be a separate model, or any combination of the first classification model, the second classification model, and the third classification model.
- the determining module 420 can obtain information such as current weather and real-time road conditions. In some embodiments, the determining module 420 may adjust the response strategy according to information such as current weather and real-time road conditions. For example, when the weather is bad, the weight of "soothing" emotion in the response strategy (for example, responding to emotion) can be appropriately increased. For another example, when the weather is fine and the road conditions are smooth, the weight of the "happy" emotion in the response strategy (for example, responding to emotion) can be appropriately increased.
- the determining module 420 may compose a feature sequence of interactive instructions of the target object, historical data of the target object, current weather, and real-time road conditions, and input the combined feature sequence into the embedded model based on the RNN model to obtain the instruction representation vector . Further, characteristics such as the instruction representation vector and the information dimension data of the service platform are input into the response strategy prediction model to obtain the response strategy output by the response strategy prediction model.
- the embedded model and the response strategy prediction model can be obtained through joint training.
- the combined feature sequence is composed of feature combination values at several time points.
- the characteristic combination value at each time point is formed by the combination of the target object's interactive instruction, the target object's historical data, weather data, and real-time road condition data at that time point, and is multiplied by the time weight coefficient.
- the time weight coefficient can be different depending on the distance of the time point, and the weight coefficient of the time point closer to the current time can be larger.
- a transformation when forming the characteristic combination value, a transformation can be made to the real-time road conditions according to the current weather, so that the characteristic value reflecting the congestion of the road conditions under extreme weather is reduced, so as to reduce the influence of the special data.
- the aforementioned transformation may be: reducing the weight of the real-time road condition to 0.01.
- the influence of different factors on the forecast results can be better reflected, especially the interrelationship between these factors.
- the impact of weather is related to the front and back
- the impact of real-time road conditions is also related.
- RNN-based processing can reflect the relationship between the front and back time points and make the response strategy more accurate.
- Step 530 based on the response strategy, determine the response language for the target object.
- step 530 may be performed by the response module 430.
- Responsive speech is the language information that the device or system responds to the target object in response to the interactive command. It is understandable that response speech is a specific language output, and the language may contain information such as response emotion, response style, and/or response content. For example, a response phrase "The driver is great!" is played that is full of passion (that is, responding to emotions) (that is, responding style) and boasting that the driver's service is very good (that is, the response content). It is understandable that there can be one or more languages expressing the same response strategy (for example, response content, response style, or response emotion), that is, there can be a one-to-one or one-to-many relationship between response strategies and response words.
- response speech is a specific language output, and the language may contain information such as response emotion, response style, and/or response content. For example, a response phrase "The driver is great!" is played that is full of passion (that is, responding to emotions) (that is, responding style) and boasting that the driver's service is very good
- the response module 430 may determine response words based on the response strategy. In some embodiments, the response module 430 may determine the content that needs to be expressed in response to words according to the determined response content, and then determine the specific expression mode of the response content according to the response emotion and/or response style, including whether to add modal particles and degree words Or other words that embody emotions and styles, or output voice intonation, etc.
- the output response words are praised language, for example, "you are good”.
- the response strategy is an exaggerated response style, and the degree words "very” and “very” are added to the response content, for example, "you are very good”.
- the output response words can be with some modal particles expressing joy, for example, "You are very good” and so on.
- the database may store preset words for the same or different response content, response emotion and/or response style, so that according to the response strategy (response content, response emotion and/or response style), from Response words corresponding to the response strategy obtained in the database.
- the stored preset words can be customized and recorded in advance by the user (for example, the user on the driver side or the user on the passenger side, etc.), or can be preset by the developer in advance.
- the language or words corresponding to the response content, response emotion and/or response style can also be extracted from a public platform (for example, Wikipedia, etc.), and response words can be generated.
- the response words can also be generated through models or algorithms, for example, transformers, Bert models, and so on. For example, input the dictionary and designated words into the model, and output the response words.
- one or more preset words may be obtained from the storage device, or one or more preset words may be generated.
- the terminal or the response module 430 may automatically select a preset speech as the response speech according to a preset rule among a plurality of preset response speeches, and output the response speech.
- the terminal can randomly select a preset speech as the response speech.
- the terminal may use the user or user group's most frequently used preset speech as the response speech.
- the user group can be all users, all passengers, all drivers, and all users in the user's area (for example, a city, district, or a custom area, such as a circular area within 5 kilometers, etc.).
- the response module 430 may output response words to the target object by responding to text and/or responding to voice.
- the text corresponding to the response words is converted into speech and output to the target object.
- whether to output the response voice or output the response text may be determined according to the actual scene. For example, when outputting response words, if the current terminal is the driver's end and the driver's end is currently in a vehicle driving state, only the response voice can be output. At this time, avoid outputting the response text to distract the driver, and avoid the driving safety problems caused by this. In addition, in this scenario, the response voice and response text can also be output at the same time.
- the response words when outputting response words, it can be detected whether the terminal is in an audio or video playback state; if so, the response text can be output; otherwise, one or more of the response text and the response voice can be output.
- the response words can be displayed in the preset display interface, and can also be displayed in the status bar or notification bar. For example, if the driver is in the driving state of the vehicle, the response words can be output on the current display interface; If the current terminal is in the audio and video playback state, you can display response words in a small window in the status bar or notification bar.
- the semantics of the response voice and the response text may be the same or different, and may be specifically set according to the scene. For example, for the same response content "praising the driver for good service”, the response voice can be “the driver is the most sunny”, the response text can also be “the driver is the most sunny”, the semantics of the two are the same; the response text can also be "wind in the wind” In the rain, thank you for your hard work.”
- the response module 430 may also output the response words to the target object in other ways.
- an image or video can be used as the response language to be output to the target object, for example, a video or picture expressing the actual content of the response language can be produced as the output.
- the target object can be the user or communication user of the current terminal.
- the interactive instruction or voice data may be directed to the user, or to the user of the communication partner.
- the interactive instruction contains the specified words to praise the user.
- the interactive instruction contains the specified words to praise the user.
- the voice data sent by the driver includes a designated speech as "praise me” or “praise the driver”, then the designated speech is for itself; or, if the driver’s end If the voice data contains the designated speech technique "Praise the Passenger", the designated speech technique is aimed at the opposite user of the current communication, that is, aimed at the passenger side.
- the designated speech technique is for itself; or, if the voice data from the passenger end
- the sent voice data contains the designated speech technique "boasting the driver”, then the designated speech technique is for the opposite user of the current communication, that is, for the driver's side.
- the response language when the output of the response language is executed, the response language may be output to the own terminal (ie, the current terminal) or/and to the terminal of the other user respectively.
- the response language when the interactive instruction is directed to itself, the response language is directly output on the current terminal; or, when the interactive instruction is directed to the opposite user of the current communication, the response language is output to the opposite user’s terminal, and the corresponding language can also be output like the current terminal. Surgery.
- Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the process 600 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 610 Extract the associated features of the target object based on the interactive instruction for the target object.
- step 610 may be performed by the extraction module 410.
- the interaction instruction of the target object includes voice data.
- the associated feature of the target object may include the voice feature of the voice data.
- the voice data can be any one or a combination of multiple languages, and the voice feature can be the voice feature of the multilingual voice data.
- the associated feature of the target object may further include: the text feature of the text data corresponding to the interactive instruction.
- Step 620 Process the voice features based on the first classification model, and determine whether the voice data contains a specified language of any one of multiple languages. In some embodiments, step 620 may be performed by the determining module 420.
- the designated language can be a language that contains (but is not limited to) specific words or sentences.
- the designated speech can be used to determine the semantic information in response to the speech.
- the designated words can be characteristic words or approximate semantic words that need to be included in the response words, or the target of the response words, etc.
- the designated speech technique can be determined according to actual needs. For example, for the scene of complimenting the service provider, the designated language can include "praise”, “comrade”, “encourage”, “reward” or “good”, etc., and the designated language can also include “passenger”, "driver”, and so on. "Service provider", “service requester”, “I”, etc.
- the first classification model refers to a calculation model implemented by a computing device, and the first classification model is a model that determines whether or not it contains a specified language.
- the first classification model may process the voice features and output the classification recognition result to determine whether it contains the specified words.
- the first classification model can process the text features and output the classification recognition results to determine whether the specified words are contained.
- the first classification model can process the voice features and text features to determine whether it contains the specified words. See step 1730 for more details about the classification and recognition results.
- the first classification model can include two sub-classification models, which are respectively a classification model for processing speech features and a classification model for processing text features. Classification model.
- the first classification model may also be a model for processing text features and voice features, and the voice features and text features can be input into the first classification model in the same form (for example, a vector or a normalized vector).
- the first classification model can be obtained through end-to-end training.
- the first classification model may have a one-to-one or one-to-many relationship with the language.
- all languages can correspond to the same first classification model, and for example, different languages correspond to different first classification models.
- the speech data can be recognized (for example, by a speech decoder, etc.) before the speech features are input into the first classification model, and the speech data can be recognized.
- Language further, input the voice feature into the first classification model corresponding to the recognized language, and determine whether the voice data contains the specified words.
- multiple languages can correspond to the same first classification model, that is, the first classification model can process the voice features of multiple languages, and determine whether the voice data contains any one of the multilingual specified words. .
- the first classification model or sub-classification model is a machine learning model.
- the first classification model or sub-classification model may be a classification model or a regression model.
- the types of the first classification model or sub-classification model include but are not limited to neural network (NN), convolutional neural network (CNN), deep neural network (DNN), neural network (NN), convolutional neural network (CNN), loop A neural network (RNN) or any combination thereof, for example, the second classification model or sub-classification model may be a model formed by a combination of a convolutional neural network and a deep neural network.
- the first classification model or sub-classification model may be composed of a multi-layer convolutional neural network CNN and a multi-layer fully connected network.
- the first classification model or sub-classification model can be composed of a multi-layer CNN residual network and a multi-layer fully connected network.
- the first classification model may be a 5-layer CNN residual network and a 3-layer fully connected network.
- the first classification model can construct a residual network on the CNN network to extract the hidden layer features of the voice data, and then use the multi-layer fully connected network to map the hidden layer features output by the residual network.
- the softmax transfer function (softmax) classification output obtains the multi-class recognition result.
- the CNN network structure in the first classification model used, compared to a single fully connected network, can be used to extract features, so that while ensuring the recognition accuracy, the scale of network parameters can be effectively controlled not to be too large. , To avoid the problem that the scale of the first classification model is huge and it is difficult to effectively deploy on the terminal side.
- the sub-classification model used to process text features may also be a text processing model, for example, a Bert model.
- the first classification model may be obtained by offline training in advance and deployed on the terminal device.
- the first classification model can be obtained by offline training in advance, and is deployed and stored in a storage device or deployed on the cloud.
- the terminal device has access rights to the storage device or the cloud.
- the first classification model can also be obtained by online training based on current data in real time. For details about the training of the first classification model, refer to the related description in FIG. 7, which will not be repeated here.
- the determining module 420 may also determine whether the voice data contains a specified language of any one of multiple languages in other ways. For example, you can process text features through rules to determine whether it contains specified words. For another example, fusion (eg, weighted summation, weighted average, etc.) is based on the result of processing and determining the voice feature based on the first classification model, and the result of processing and determining the text feature based on the rule, and finally determines whether the interactive instruction contains the specified words. Surgery.
- fusion eg, weighted summation, weighted average, etc.
- Step 630 in response to the voice data containing the designated speech, determine the response content corresponding to the designated speech. In some embodiments, step 630 may be performed by the response module 430.
- the recognition result of the first classification model indicates that the voice data contains the specified speech
- the response content corresponding to the specified speech is determined, and the response speech is further determined and output based on the response content.
- the response content can be semantic information.
- the response content can be a specified phrase or content similar to the meaning of the specified phrase.
- the same response content can be expressed in multiple languages, that is, corresponding to multiple response words.
- the designated speech technique may correspond to one or more response words, that is, the specified speech technique and its corresponding response words may have a one-to-one or one-to-many relationship.
- the corresponding response words can be the same or different.
- the specified words and their corresponding response words may be stored in a database or a storage device, and the response words corresponding to the specified words may be obtained from the storage device.
- the response emotion or response style can also be determined, so as to output response words in combination with the response content, response emotion or response style.
- response emotion and response style see Figure 8, Figure 10 and related descriptions.
- step 530 For more information about obtaining response words, please refer to step 530 and its related descriptions.
- Fig. 7 is an exemplary flowchart of the first classification model training according to some embodiments of the present application.
- the process 700 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 710 Obtain multiple first training samples.
- step 710 may be performed by the acquisition module 440.
- the first classification model may be trained based on a plurality of first training samples.
- Each first training sample may include voice data, that is, first sample voice data.
- the first sample voice data may be multilingual voice data.
- the first sample voice data may be English voice data, Japanese voice data, Chinese voice data, Korean voice data, etc., which are not exhaustive.
- the plurality of first training samples include positive samples and negative samples.
- the positive sample is the first sample of speech data related to the specified speech.
- the positive sample is the first sample speech data carrying the specified speech, or the first sample speech data carrying the semantic similar to the specified speech.
- the negative sample is the first sample of speech data that is not related to the specified speech, for example, does not contain the specified speech or does not contain the same meaning as the specified speech.
- positive and negative samples can be identified, for example, positive samples are 1 and negative samples are 0.
- the output result of the first classification model obtained by training can be a probability between 0-1, or whether it is a classification result that includes a specified language.
- classification recognition results please refer to FIG. 17 and related descriptions.
- the first training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the first training sample.
- a sample speech recognition result of the first sample speech data may be obtained first, and the sample speech recognition result is used to indicate whether the first sample speech data is related to a specified speech. Further, the first sample voice data can be identified or positive and negative samples can be classified according to the result of sample speech recognition.
- the sample speech recognition result may be a sample text recognition result, or a label obtained by manual annotation, or a combination of the two.
- the first sample voice data can be converted into the first sample text data based on a voice decoder (also called a voice converter), and further, the first sample text data can be recognized or Analyze to determine whether the first sample text data is related to the specified speech, so as to determine the identity of the positive and negative samples or the first sample voice data. For example, it is determined whether it is related to the specified words by means of keyword matching or text similarity.
- a voice decoder also called a voice converter
- the text similarity between the first sample text data and the specified verbal text is calculated by text matching (for example, Euclidean distance, etc.), if the text similarity reaches (greater than, greater than or equal to) the preset similarity Threshold, the first sample text data is a positive sample; otherwise, the first sample text data is a negative sample.
- the similarity threshold is not particularly limited, and may be 80%, for example.
- the character standard of the first sample text data can also be calculated, and the character standard can be used as an evaluation criterion for calculating text similarity.
- the first sample voice data may be multilingual voice data.
- the first sample voice data may be converted into the first sample text data based on the voice converter corresponding to the language of the first sample voice data, and it is further determined whether the specified words are included, that is, the first sample voice data is determined. Whether the voice data is a positive sample or a negative sample.
- the decoding accuracy of the speech decoder cannot meet the preset recognition requirements, or the accuracy of determining positive and negative samples through text similarity or keyword matching is low
- it can also be combined with manual labeling to achieve classification. For example, output results based on text similarity, unsuccessful recognition or first sample speech data with low recognition accuracy can be output on the screen so that users can verify or correct (label) the automatic classification results. After that, the result of manual labeling is used as the identification.
- the ratio of positive samples and negative samples can be controlled.
- the ratio of positive samples to negative samples can be controlled to 7:3.
- the possibility of screening the positive sample and the negative sample may also be involved, so that the ratio of the positive sample and the negative sample is within a preset ratio range.
- the first classification model can handle text features.
- the first training sample may also include first sample text data.
- the positive samples are related to the specified speech, and the negative samples are not related to the specified speech.
- Step 720 Train an initial first classification model based on the multiple first training samples to obtain the first classification model.
- step 720 may be performed by the acquisition module 440.
- the first classification model may be trained through various methods based on the first training sample, and the parameters of the initial first classification model are updated to obtain the trained first classification model.
- Training methods include but are not limited to: based on gradient descent method, least square method, variable learning rate, cross entropy, stochastic gradient, and cross-validation methods. It is understandable that after the training of the positive sample and the negative sample is completed, the obtained first classification model has the same network structure as the initial first classification model.
- the voice features of the positive and negative samples can be extracted, and then the voice features of the positive and negative samples are used for model training. It should be understood that the manner of performing speech feature extraction for the positive and negative samples is the same as that of the foregoing step 610, and will not be repeated here.
- the positive sample and the negative sample data can be mixed training according to a certain ratio, for example, a ratio of 7:3. In some embodiments, this can be achieved by means of whole sentence training.
- the first sample voice data is the voice data of a complete sentence and so on.
- the training ends.
- the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.
- the parameters in the initial first classification model can be initially assigned, and then the positive and negative samples are used for training, combined with the accuracy of the output results of the positive and negative samples, to compare the parameters in the initial first classification model.
- the parameters are adjusted, and the training process is repeated multiple times, and finally a parameter with a higher accuracy of the classification result is obtained, and the parameter is used as the parameter of the first classification model.
- a test set can be constructed and used to test the classification result of the first classification model.
- the real performance of the first classification model can be evaluated by calculating the accuracy rate and the false recognition rate of the prediction result, and based on the real performance, the parameters of the first classification model can be adjusted, or further training processing can be determined.
- the first classification model can have a one-to-one or many-to-many relationship with a language.
- training is performed by the first training sample of the corresponding language of the first classification model.
- training is performed through the first training samples corresponding to multiple languages of the first classification model.
- the acquisition module 440 can verify, test, and update the first classification model. For details, refer to steps 920, 1120 and related descriptions.
- Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the process 800 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 810 Extract the associated features of the target object based on the interactive instruction for the target object.
- step 810 may be performed by the extraction module 410.
- the interaction instruction may include voice data
- the associated feature may include at least one of a voice feature of the voice data and a text feature of the text data corresponding to the interaction instruction.
- Step 820 Process at least one of the voice feature and the text feature based on the second classification model, and determine the respondent emotion of the voice data. In some embodiments, step 820 may be performed by the determining module 420.
- the respondent emotion refers to the emotion of the interactive command.
- the types of emotions that are responded to may include: loss, calm, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc.
- the responding emotion can be obtained based on the processing of the interactive instruction.
- the second classification model refers to a computing model implemented by a computing device, and the second classification model is a model for determining the response emotion of an interactive command.
- the second classification model can be a binary classification or a classification model.
- For the type of the second classification model refer to the first classification model, that is, step 620 and related descriptions.
- the first classification model and the second classification model type can be the same or different.
- the second classification model can process speech features to determine the responding emotion.
- the second classification model can process the text features and determine the response sentiment.
- the second classification model can process voice features and text features to determine the response emotion. It is understandable that, similar to the first classification model, if the second classification model can process speech features and text features, the second classification model can include two sub-classification models, which are the classification models that process speech features, and the A classification model for processing text features.
- the second classification model can be obtained through end-to-end training. For determining the response emotion based on the output result of the second classification model, refer to the related description in FIG. 20 or 21.
- the second classification model can be obtained through training. Regarding the training of the second classification model, refer to the related description in FIG. 9, and details are not repeated here. Regarding the deployment of the second classification model, it may be similar to the first classification model, see step 720 and related descriptions.
- the second emotion obtained by processing the text feature and the first emotion obtained by processing the voice feature may determine the response emotion based on the first emotion and the second emotion.
- the first emotion refers to the emotion determined based on the voice feature.
- the first emotion may be obtained based on the processing of the voice feature by the second classification model. It is understandable that when the second classification model determines the first emotion based on the speech features, in addition to the semantic information of the language in the speech data, the intonation information or tone information in the speech data can be combined to make the determined emotion more accurate.
- the second emotion refers to the emotion determined based on the text feature.
- the type of the second emotion may be the same as or different from the type of the first emotion.
- the text features include keyword features.
- the second emotion may be determined based on the keyword features.
- the types of the first emotion and the second emotion may include, but are not limited to, loss, peace, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc.
- the keyword features include emotion-related words
- the determining module 420 may determine the second emotion based on the emotion-related words.
- Emotion related words refer to words that can indicate emotions in interactive instructions.
- emotion-related words may include, but are not limited to: one or more of modal particles and degree words.
- modal particles can include: “please”, “ba”, “ah”, “Ma”, etc.
- degree words can include, but are not limited to: “very”, “very”, “relentless”, etc., which are not exhaustive .
- emotion-related words in the text data can be identified, and the second emotion can be determined according to the emotion-related words. For example, based on preset rules and based on the recognition result of the emotion-related words, the emotion corresponding to the recognition result may be used as the second emotion.
- Different emotion-related words may have a mapping relationship with emotions, and the mapping relationship may be manually preset and stored in a storage device. For example, if the emotion corresponding to "Ah” is preset as "joy”, the emotion corresponding to "Ma” is "sad” and so on. Take the exaggeration scene as an example.
- the voice data sent by the user is converted into text data
- the content is "Can you praise me?"
- the second emotion is: sad; if the content is "praise me”, second The emotion is: joy.
- the corresponding emotion when there are both a degree word and "Ah” is “excited”, and the emotion corresponding to a degree word and "Ma” is also "sad”.
- the sentiment score may be preset for each sentiment related word, that is, the scores of the sentiment related words corresponding to different emotions. All emotion-related words in the text data can be identified, and the emotion scores of these emotion-related words can be calculated or weighted, and the second emotion can be determined based on the calculated value.
- the second emotion may also be determined in other ways.
- text features can be input into a text processing model (for example, Bert, etc.) to determine the second emotion.
- the first emotion and the second emotion can be expressed numerically, and then the two values can be calculated or weighted (for example, weighted averaging) to obtain a weighted value, and then, based on the weighted value, it is determined that the response is emotion.
- different emotions can correspond to different numerical values or numerical ranges, for example, excitement is 2, joy is 1, pain is -1, and so on.
- the value can be set in advance according to requirements or rules.
- the second classification model can output the probability value corresponding to the first emotion, and the text processing can obtain the probability value corresponding to the second emotion, which can be based on weighting the probability value (for example, multiplying with the weight or comparing it with the weight). Plus, etc.), the emotion with the highest probability value is regarded as the responding emotion.
- the first emotion or the second emotion is regarded as the responding emotion.
- the first emotion can be taken as the respondent emotion, or the respondent emotion can be manually determined.
- Step 830 Determine the response emotion based on the response emotion.
- step 820 may be performed by the determining module 420.
- the responding emotion and the responding emotion may be similar, the same, or opposite.
- the responding emotion is joy, and the responding emotion can be joy or excitement.
- the responding emotion can be loss, and the responding emotion can be calm or joy.
- the response emotion may only include positive emotions to appease the user's negative emotions.
- the corresponding relationship between the response emotion and the response emotion may be preset.
- the response emotion and response emotion can be one-to-one or many-to-many relationships.
- the preset method can be determined based on rules, and can also be determined or optimized based on historical feedback data. For example, according to the user's feedback information on the response words.
- the corresponding relationship between the response emotion and the response emotion may be stored in the storage device in advance. For example, it is stored in the memory of the terminal, or in other storage locations readable by the terminal, such as the cloud, which is not limited. Based on the response emotion, the response emotion can be obtained from the storage device.
- the response module 430 may determine and output response words based on the response emotion. Refer to step 530 and related descriptions for determining response words based on response emotions.
- Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application.
- the process 900 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 910 Obtain multiple second training samples.
- step 910 may be performed by the acquisition module 440.
- the second training sample may include: one or more of the second sample speech data and the second sample text data, and the corresponding emotion label.
- the second training sample may include the second sample speech data and the corresponding emotion label.
- the second training sample may include the second sample speech data, the second sample text data, and the corresponding emotion label.
- the emotion label is the emotion corresponding to the second sample speech data or the second sample text data.
- the type of emotion tag may be one hot tag.
- the second sample text data is text data corresponding to the second sample voice data.
- the second sample text data is obtained by performing text recognition on the second sample voice data.
- the second sample voice data is obtained based on machine conversion or manual reading of the second sample text data.
- the second sample voice data may come from real online voice data; and/or, it may also come from customized data.
- the second sample voice data is generated by manual reading.
- the text content of the second sample and the corresponding tone standard can be formulated, and then artificial and emotional reading is used to obtain the second sample voice data of different tone and emotion.
- the length of the text corresponding to the second sample voice data or the second sample text data is generally not too long (for example, the individual input of a word or word is less than a threshold value, which may be 20, 10, etc.). Because too long text will cause the speech to be too long, and further may cause greater fluctuations in the tone, and cause the environmental noise to be more random and complicated.
- the first training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the first training sample.
- Step 920 Train the initial second classification model based on the multiple second training samples to obtain the second classification model.
- step 920 may be performed by the acquisition module 440.
- the second classification model may be trained through various methods based on the second training sample, and the parameters of the initial second classification model are updated to obtain the trained second classification model.
- the training method may be similar to the first classification model, see step 710 and related descriptions. It is understandable that after the training of the second training sample is completed, the obtained second classification model has the same network structure as the initial second classification model.
- feature extraction can be performed on the second sample voice data or/and the second sample text data, and the second classification model can be trained based on the extracted voice features and/or text features.
- the second training sample used for training includes the second sample speech data and its emotion labels. If the second classification model is used to process text features, the second training sample used for training includes the second sample text data and emotion labels. If the second classification model is used to process voice features and text features, the second training sample used for training includes the second sample voice data, the second sample text data and the emotional label, and the training is done through an end-to-end manner.
- a training method of indefinite length of the whole sentence can be adopted, and the features extracted from one sentence are used as the input of the classifier to obtain the emotion output by the second classification model, and then use the output emotion and Adjust the parameters of the second classification model based on the difference between the sentiment labels, and finally obtain the second classification model with higher classification accuracy.
- model verification can also be performed on the current training model.
- Model verification is divided into two processes: test environment construction and model testing. The test environment is built to check whether the current model can be successfully built and run normally on different terminals, such as different brands of mobile phones. Therefore, the test needs to be tested offline according to the real scene.
- the test process can include but is not limited to the following two test methods.
- the first test method can test multiple times for real people in real time, and then count the accuracy of the recognition results.
- the advantage of this test method is that it can better simulate user behavior in real scenarios, and the reliability of the test is higher.
- the second test method is that a real person records a test set in a real scene. One or more test sets can be recorded as needed, which can be reused, with lower cost, and the objective validity of the test can be guaranteed to a certain extent.
- Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the process 1000 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 1010 Extract the associated features of the target object based on the interactive instruction for the target object.
- step 1010 may be performed by the extraction module 410.
- the associated feature includes at least one of the first feature corresponding to the interactive instruction and the second feature corresponding to the historical data of the target object. Regarding the associated feature and its extraction, refer to step 510, which will not be repeated here.
- Step 1020 Process at least one of the first feature and the second feature based on the third classification model, and determine a response style for the target object. In some embodiments, step 1020 may be performed by the determining module 420.
- the third classification model refers to a calculation model implemented by a computing device, and the third classification model is a model that determines the response style of the target object.
- the third classification model can be a multi-class or two-class model.
- the first classification model can be the same or different from the third classification model.
- the third classification model can be obtained through training.
- For the training of the third classification model refer to the related description in FIG. 11 for details.
- For the deployment of the third classification model refer to the deployment of the first classification model, and refer to step 620 and related descriptions.
- the processing device may process at least one of the first feature and the second feature based on the third classification model to determine the response style.
- the first feature and the second feature are input to the third classification model, and the response style is output.
- the first feature and the second feature can be weighted respectively (for example, the first feature The weight of is greater than the weight of the second feature).
- the first feature or the second feature is input to the third classification model, and the response style is output.
- the first feature and the second feature refer to step 510 for details.
- the processing device may process the text feature of the text data corresponding to the interactive instruction, and process at least one of the first feature and the second feature based on the third classification model to determine the response style.
- the first style is a style determined based on the first feature or/and the second feature. For example, processing at least one of the first feature and the second feature based on the third classification model to determine the style.
- the second style is a style obtained based on text feature processing.
- the second style may be the same or different from the first style, including but not limited to: exaggerated, normal exaggerated, slightly exaggerated, and the like.
- the text features can be processed based on models, algorithms, or rules to obtain the second style. For example, by identifying whether the text contains keywords related to style (such as very, very, etc.), the second style is further determined based on the keywords. For another example, the text features are processed through a text processing model (for example, Bert, DNN, etc.), and the second style is output.
- a text processing model for example, Bert, DNN, etc.
- the response style may be determined based on at least one of the first style and the second style. For example, any one of the first style or the second style may be determined as the response style. For another example, if the first style is different from the second style, the first style can be used as the response style.
- fusion processing may be performed on the first style and the second style to determine the response style.
- different styles can correspond to different scores (for example, exaggerated to 3, normal exaggerated to 2, etc.), you can set different weights for the first style and the second style, and merge the weights and scores of the two styles, based on The resulting fusion determines the response style.
- the response style can also be determined by other fusion methods, which is not limited in this embodiment.
- Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application.
- the process 1100 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
- Step 1110 Obtain multiple third training samples.
- step 1110 may be performed by the obtaining module 440.
- the third training sample can be derived from real data (for example, historical data), or can be derived from formulated data.
- the developer can formulate sample data and input it into the terminal so that the terminal can train the third classification model.
- the third training sample includes sample interaction instructions for the sample target object, sample history data of the sample target object, and corresponding style labels.
- the method and content of the sample historical data and the aforementioned historical data can be the same, and will not be repeated here.
- the style label represents the sample response style for the sample target object.
- the sample style label may be a onehot label.
- the style label may be determined based on the reputation data and feedback data of the sample target object. Specifically, determining the style label includes: obtaining feedback data of the sample target object’s response to the sample, which is determined based on the sample interaction instruction; obtaining the reputation data of the sample target object; and determining the style based on the reputation data and/or feedback data label.
- Feedback data refers to the data related to the response of the sample target to the sample's response words.
- the response includes, but is not limited to, positive (or positive), negative (or negative), positive, negative, etc.
- the response can be used to directly characterize the feedback data. For example, if the sample target has no feedback, the feedback data can be set to positive by default. If the feedback information is positive or positive evaluation, the feedback data is positive or positive, etc.
- Reputation data refers to data related to the credit status of the sample target object.
- Reputation data can be determined based on historical data.
- the reputation data can be specifically expressed as a reputation score, and the calculation method of the reputation score is not repeated here. Exemplarily, when the reputation score reaches (greater than, or, greater than or equal to) 80 points, the target object of the sample is a user with a high reputation, and the reliability of the feedback information is high, which can also give developers a reference and assistance The developer manually annotates the sample data.
- the reputation data and/or feedback data may be associated with the response style, and based on the corresponding relationship, the style label may be determined.
- the reputation data and feedback data can also be processed through the model to determine the style label, where the model can be a DNN model, a CNN model, an RNN model, etc.
- the style label may also be determined based on manual evaluation of reputation data and feedback data.
- the terminal can obtain feedback data of the sample target object for historical response speech, and then output the reputation data and feedback data of the sample target object, and receive manual evaluation data for the reputation data and feedback data.
- the manual evaluation data is used to indicate style label.
- the style label is also obtained based on the developer's manual labeling, this kind of manual labeling is implemented based on the feedback data and reputation data output by the terminal, which is beneficial to assist the developer to complete the labeling quickly and reduce the labor cost as much as possible. Time and cost.
- the third training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the third training sample.
- Step 1120 Train an initial third classification model based on the multiple third training samples to obtain a third classification model.
- step 1120 may be performed by the acquisition module 440.
- the third classification model may be trained through various methods based on the third training sample to update the parameters of the initial third classification model to obtain a trained third classification model.
- Training methods include but are not limited to: calculating loss based on gradient descent method, least square method, cross entropy, cross validation, variable learning rate, etc. It is understandable that after the training of the third training sample is completed, the obtained third classification model has the same network structure as the initial third classification model.
- the features of the third training sample (for example, voice features, text features) can be extracted, and then the features of the third training sample can be used for model training.
- the training ends.
- the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.
- an end-to-end training method is used, and the feature extracted based on the third training sample is used as input, and the style recognition result is output. Then, the difference between the output style recognition result and the style label is used to adjust the parameters of the initial third classification model, and finally a third classification model with higher classification accuracy is obtained.
- real-time data can also be used to update the third classification model.
- the terminal may also obtain operation information for the response words, so that the third classification model is updated by using the response words and the operation information.
- the operation information may be the operation information that the target object evaluates or feeds back on the response speech, and the operation information may also be used as a third training sample to update the third classification model in real time.
- the third classification model can also be tested and verified, which is similar to the second classification model and will not be repeated.
- Fig. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- the issuer of the interactive instruction for the target object can be the driver issued by the driver, and the target object can be the driver.
- the driver-side user can click the function control 1201 in the driver-side display interface of the taxi APP to enter the exaggeration interface, then the terminal can display the interface as shown in FIG. 12B.
- Fig. 12B is a display interface with exaggerated function. On the display interface, the user at the driver end can make a voice.
- the terminal collects real-time voice data, that is, receives an interactive instruction. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver's end user includes one of "praise the driver” or "praise me", the display interface as shown in FIG. 12C can be displayed on the terminal. As shown in Figure 12C, the response word 1203 for "praising me” is displayed on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up".
- the driver-side user can also click the boast control 1202 to trigger the boast function, and then display the interface as shown in FIG. 12C, which will not be described in detail.
- the function control 1201 can also prompt the driver's newly received exaggeration.
- Fig. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- a scenario where a user on the driver side brags is taken as an example.
- the driver-side user can click the function control 1301 in the driver-side display interface of the taxi APP to enter the exaggeration interface, and then the terminal can display the interface as shown in FIG. 13B.
- Figure 13B is the display interface of the exaggeration function.
- the driver's end user can make a voice, and accordingly, the terminal collects real-time voice data "praise me" (or "praise the driver”). An interactive instruction was received.
- the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver-side user includes one of "praise the driver” or "praise me", the display interface as shown in FIG. 13C can be displayed on the terminal. As shown in Figure 13C, the response word 1303 for "praise me” is displayed on the current display interface, specifically: "The driver is the sunniest, enthusiastic, kind, and knows the cold and knows the hot!. Similar to FIG. 12, the driver-side user can also click the boast control 1302 to trigger the boast function, and then display the interface as shown in FIG. 13C.
- the terminal collects the voice data, it can be determined that the response style preferred by the driver's user (target object) is normal exaggeration, and based on this, the response language with the response style can be determined.
- the terminal displays response words 1203 for "praising the driver” or "praising me” on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up”.
- the terminal determines that the response style preferred by the driver's user (target object) is exaggerated, and accordingly determines the response style with the response style.
- the terminal displays 1303 response words for "praise me” on the current display interface, specifically: "The driver is the sunniest, most enthusiastic, kind, and knows the cold and knows the hot!”.
- each preferred response style can be obtained, and the terminal can respond to different degrees of praise (response) based on different response styles.
- the user can also have the authority to modify the response speech.
- FIG. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- Figure 14A shows the communication interface between the user terminal and the driver terminal.
- the passenger terminal user can click on the voice switch control 1401 to trigger the voice input function.
- the terminal displays the interface shown in Figure 14B.
- the terminal can collect real-time voice data, that is, receive interactive instructions. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech.
- the display interface shown in FIG. 14C can be displayed on the terminal.
- the user terminal sends response words 1403 to the driver terminal, specifically: "The driver is the most sunny, enthusiastic, kind, and knows the cold and knows the hot!”.
- the driver side it can prompt the user to receive a boast from the passenger side, for example, in the function control 1301 in the interface shown in Figure 13A, or in the notification bar or status bar. .
- the passenger-side user can also click the boast control 1404 on the display interface to trigger the boast function.
- the user clicks the exaggeration control 1404 to exaggerate the user can enter the voice collection step, and the exaggeration can be realized in the manner shown in FIG. 12 or 13.
- Fig. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 15, you can also directly enter the boast interface.
- the communication interface shown in FIG. 15A is the same as the communication interface shown in FIG. 14A.
- the user on the passenger terminal can click the boast control 1404.
- the terminal displays the interface shown in FIG. 15B. On this interface, the terminal is determined to exaggerate, and can directly determine the response words for the driver's end user.
- Fig. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application.
- the user can also have the authority to modify the response skills.
- the display interface shown in FIG. 16A is the same as the display interface shown in FIG. 15B.
- the currently determined response language of the terminal is "The driver is the sunniest, most enthusiastic, kind-hearted, and knows cold and hot!”. If the passenger terminal user is not satisfied with the response language, he can click the language switching control 1601 to switch the response language.
- the terminal displays the controls shown in FIG. 16B.
- the currently determined response language is "the driver is the sunniest and most reliable person". In this way, the switching of response speech is realized.
- the current terminal may also perform statistical processing on historical response words and display them. In some embodiments, currently it can also be used to perform the following steps: obtain historical response words from other users, and then determine the total output of historical response words, and, according to the historical response words, determine one or more The speech tag, and furthermore, the total number of outputs and the speech tag are displayed.
- the speech tag can be designed according to actual needs. For example, a scene of historical response to verbal skills can be used as a tag, and a scene of historical response to verbal skills and the number of times of historical response to verbal skills in the scene can also be used as the tag of verbal skills. For another example, the response style or emotion of the response language can be used as a label.
- Figure 2B also shows three linguistic tags, namely: "Rainy day boast 999+", “Late night boast 3" and "Holiday boast 66".
- the language tag in this scene consists of the exaggeration scene and the number of exaggerations in the scene.
- Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the interactive instruction may include voice data.
- the first classification model can also be called a trained multilingual speech classifier (referred to as "multilingual speech classifier").
- the method of human-computer interaction includes:
- Step 1710 Collect current voice data.
- the terminal can collect voice data sent by the user in real time and perform subsequent processing.
- the collected voice data can be any of Chinese voice, English voice, Japanese voice, Korean voice, etc., and there is no restriction here.
- the terminal can automatically monitor and collect the voice data sent by the user.
- the user can also press the semantic input button on the display interface to trigger and collect voice data.
- Step 1720 Extract voice features in the voice data.
- the voice feature may be a multi-dimensional fbank feature.
- the human ear's response to the sound spectrum is non-linear
- the fbank feature is obtained by processing audio in a manner similar to the human ear
- the fbank feature is beneficial to improve the performance of speech recognition.
- the Fbank feature in the voice data can be extracted through the following steps: signal conversion from time domain to frequency domain is performed on the voice data to obtain the frequency domain voice data; and the energy spectrum of the frequency domain voice data is calculated to obtain the voice feature.
- the voice data collected by the terminal device is a linear time-domain signal, and the (time-domain) voice signal can be transformed into a frequency-domain voice signal through Fourier transform (FFT).
- FFT Fourier transform
- the voice data can be sampled.
- the energy of each frequency band in the frequency domain signal is different, and the energy spectrum of different phonemes is different. Therefore, the energy spectrum of the frequency domain speech data can be calculated, and the speech features can be obtained.
- the method of calculating the energy spectrum will not be repeated here. For example, if the sampling frequency of the voice data collected in 1710 is 16khz, then the fbank40-dimensional feature can be extracted in this step.
- the voice data may also be preprocessed before the feature extraction step.
- preprocessing see Figure 5 and its related description.
- Step 1730 Use the trained multilingual speech classifier to process the speech features to obtain the classification recognition result.
- the multilingual speech classifier is used to determine whether the speech data contains any one of the multilingual specified words.
- the multilingual speech classifier can classify and recognize speech data in multiple languages.
- the language types that the multilingual speech classifier can recognize are consistent with the language types of the speech samples in the training process of the multilingual speech classifier.
- the classification recognition result may be a multi-classification result, including dual classification results.
- the classification recognition result is used to indicate that the voice data is a positive sample or a negative sample; or, the classification recognition result is a degree level between a positive sample and a negative sample, and each degree level corresponds to a positive sample or a negative sample. Therefore, when the degree level corresponds to a positive sample, the classification recognition result indicates that the voice data contains the specified idiom; when the degree level corresponds to a negative sample, the classification recognition result indicates that the voice data does not contain the specified idiom.
- the classification recognition result can be divided into two types: “Yes” or “No”. Among them, the classification recognition result is “Yes”, it means that the voice data contains the designated dialects of one of the multilingual designated dialects; on the contrary, the classification recognition result is "No”, it means that the voice data is compatible with any language. The specified words are irrelevant, and the specified words are not included in the voice data.
- the classification recognition result may also have other manifestations.
- the classification recognition result may be one or more of symbols, numbers, and characters (including characters of various languages, such as Chinese characters and English characters).
- the classification recognition result can be "+” or "-”; or, the classification recognition result can also be “positive” or “negative”; or, the classification recognition result can also be "result 1" or “result 2"; or , The classification recognition result can also be "positive sample” or "negative sample”.
- the results indicated by the aforementioned representations can be customized.
- the classification recognition result is "Yes”, it can mean that the voice data has nothing to do with the specified dialects in any language, and the specified dialects are not included in the voice data; the classification recognition result is "No", which can mean that the voice data contains more than one language.
- the indication of the classification recognition result can be confirmed directly according to the dual classification result.
- the classification recognition result may also be n levels, and n is an integer greater than 1.
- n levels refer to the degree to which it is recognized that the speech data belongs to the positive sample to the negative sample. For example, if the level is the highest, it is judged that the voice data belongs to a positive sample higher; conversely, the lower the rank is, it is judged that the voice data belongs to a negative sample and the degree of a positive sample is lower.
- the classification and recognition result is n
- the level is the highest, and it is judged that the voice data belongs to the positive sample; or if the classification and recognition result is 1, the level is the lowest, and the voice data is judged to be the lower degree of the positive sample. .
- the opposite can also be established. That is, if the level is the highest, it is determined that the voice data belongs to a positive sample to a lower degree; conversely, the lower the level is, it is judged that the voice data belongs to a positive sample to a higher degree.
- the respective levels corresponding to the positive samples and the negative samples can be preset. For example, for 10 classification results (n is 10, 10 levels in total), levels 1 to 5 can correspond to negative samples, and levels 6 to 10 can correspond to positive samples. Then, if the classification level result is 1, the classification recognition result indicates that the speech data does not contain the specified speech; if the classification level result is 8, the classification recognition result indicates that the speech data contains the specified speech.
- the positive sample and the negative sample are training samples used in the training phase of the multilingual speech classifier, wherein the positive sample is the multilingual speech data carrying the specified speech, and the negative sample is irrelevant to the specified speech.
- Multilingual voice data It should be understood that the positive sample (or negative sample) in the training sample contains voice data in multiple languages, and the positive sample (or negative sample) involved in the classification recognition result refers to the recognition that the voice data is a positive sample (or negative sample). The voice data of one language in the sample). See Figure 7 and related descriptions for the training process of the multilingual speech classifier.
- Step 1740 When the classification recognition result indicates that the voice data contains the specified speech, output the response speech for the specified speech.
- the response speech for the specified speech may be directly output.
- the response words may include, but are not limited to: one or more of response voice and response text. For more details about the output form of response speech, see step 530 and its related description.
- the interactive instruction or voice data can be directed to itself or to the user of the communication partner. For details, refer to step 530 and related descriptions.
- the terminal can collect voice data, perform semantic recognition on the voice data, and output response words after recognizing the semantics of the user.
- the terminal can use a monolingual acoustic model to recognize the semantics of the voice data.
- a monolingual acoustic model cannot meet the voice interaction needs of multilingual users.
- the multilingual speech classifier in some embodiments of this specification can realize the classification processing of multilingual designated words. On the basis of ensuring the classification effect, it can also convert complex speech recognition problems into simple classification problems.
- FIG. 18 is a block diagram of a terminal according to some embodiments of the present application.
- FIGS. 5-7 and FIG. 17 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
- the acquisition module 440 may include the acquisition module 1810.
- the determining module 420 may include a processing module 1820.
- the response module 430 may include an output module 1830.
- the terminal 1800 may include: a collection module 1810, an extraction module 410, a processing module 1820, and an output module 1830.
- the collection module 1810 is used to collect current voice data.
- the extraction module 410 is used to extract voice features in voice data. In some embodiments, the extraction module 410 is also used to preprocess the voice data before extracting voice features. See Figure 17 and related descriptions for details.
- the processing module 1820 is used to process the voice features by using the trained multilingual voice classifier to obtain the classification recognition result.
- the output module 1830 is used for outputting the response words for the specified words when the classification recognition result indicates that the voice data contains the specified words.
- the output module 1830 can be used to: when the specified speech is directed to itself, directly output the response speech for the specified speech; or, when the specified speech is directed to the opposite user of the current communication, output to the other user Respond to words.
- the acquisition module 440 may include a training module (not shown in FIG. 18), which is used to acquire a multilingual speech classifier through training. For details, refer to FIG. 7 and related descriptions.
- the acquisition module 440 in the terminal 1800 can also be used to acquire historical response words from other users, determine the total output of historical response words, and determine one or more words based on the historical response words. ⁇ label.
- the output module 1830 is also used to display the total number of outputs and the verbal label.
- Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the second classification model can also be called a trained sentiment classifier (referred to as "sentiment classifier").
- the method of human-computer interaction includes:
- Step 1910 Collect current voice data. For details, refer to step 1710, which will not be repeated here.
- Step 1920 Recognize the respondent emotion of the voice data; the respondent emotion is obtained from one or more of the text emotion recognition result or the voice emotion recognition result.
- Emotion recognition results include speech emotion recognition results and/or text emotion recognition results.
- one or more of text emotion recognition and speech emotion recognition may be performed on the speech data, so that the text emotion recognition result is obtained based on the text emotion recognition, and the speech emotion recognition result is obtained based on the speech emotion recognition, and further, One or more of the two are used to determine the emotion of the voice data. See Figure 20 and related descriptions for details.
- Step 1930 Determine the response emotion corresponding to the response emotion state.
- the response emotion is the emotion of the collected voice data sent by the user, and the response emotion is used when responding to the voice data, that is, the emotion in response to the voice.
- the types of emotions involved may include, but are not limited to: loss, calm, enthusiasm, or passion, etc., and the actual scene can be customized according to needs.
- emotions may also include: joy, sadness, pain, gratification, excitement, etc., and not exhaustively.
- the response emotion and the emotion category contained in the response emotion may be the same or different.
- the responding emotion and responding emotion are the four emotions of loss, calm, enthusiasm, or passion.
- the respondent emotion may include positive emotions (for example, happiness, excitement, joy, excitement, etc.) and negative emotions (for example, loss, sadness, pain, etc.), and the response emotion may only include positive emotions to target the user's negative emotions. Emotionally soothes.
- the corresponding relationship between the response emotion and the response emotion can also be preset.
- the corresponding relationship may be stored in the terminal in advance, or may be stored in a storage location readable by the terminal, such as the cloud, which is not limited.
- one responding emotion can correspond to one or more responding emotions. For example, if the response emotion is low, the corresponding response emotion can be: joy or comfort.
- step 1940 a response voice for the voice data is output, and the response voice has a response emotion.
- response speech can be output by response voice.
- the response content for the voice data may be acquired first, and then the response voice may be generated according to the response emotion and the response content, so that the response voice may be output. In this way, the output response voice also has response emotion.
- the response content is determined.
- the corresponding relationship between keywords and response content can be preset in advance, so that the response content corresponding to the keyword can be obtained by recognizing the keyword carried in the voice data as the response content of the voice data.
- the neural network model can also be used to process voice data, and further, the response content output by the neural network model can be obtained.
- the response content can be determined by the method corresponding to FIG. 5 or FIG. 6.
- the default voice (timbre) or the user-selected timbre can be used to generate the response voice.
- the user may select the timbre of a certain celebrity as the timbre of the response voice, so that the terminal generates the response voice according to the timbre of the celebrity selected by the user.
- the premise of this implementation is that the terminal can obtain the tone and authorization of the celebrity, so I will not repeat it.
- the terminal device only needs to extract a candidate voice corresponding to the response emotion and the response content in the storage location after determining the response emotion, and output it as the response voice.
- the candidate voice stored in the storage location may also be manually recorded in advance.
- the terminal In the voice interaction scenario of the prior art, the terminal generally outputs response data according to the default intonation and tone.
- This human-computer interaction method has a single response emotion and cannot meet the user's voice emotion needs in a personalized scenario.
- Some embodiments in this specification can select different response emotions according to the user's emotion in real time, which can effectively improve the matching degree of the response voice and the user's emotion, meet the user's emotional needs in different emotional states, and have a stronger sense of reality and substitution.
- the voice interaction experience is improved, which also solves the problem of low matching degree between the response voice and the user's emotion in the existing voice interaction scene.
- Fig. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application.
- determining the respondent emotion may include the following steps:
- Step 1922 Extract the voice features of the voice data.
- the audio features of the voice data can be extracted, and then the audio features are normalized to form a feature vector to obtain the voice features of the voice data.
- the dimensions of feature vectors obtained from different voice data may be different.
- the n value of the feature vector can be adjusted according to actual scene or project needs or according to empirical values. There is no restriction on this.
- Step 1924 Use the trained emotion classifier to process the voice features to obtain the emotion recognition result.
- the sentiment classifier is used to recognize the sentiment of the speech data.
- FIG 9 and related descriptions for the training of the sentiment classifier.
- the sentiment classification model please refer to Figure 8 and its related descriptions.
- step 1926 the emotion indicated by the emotion recognition result is determined to be a response emotion.
- the output of the emotion classifier is the result of emotion recognition, and the emotion indicated by the result of emotion recognition is related to the way of expression of the result of emotion recognition.
- the emotion recognition result can be a multi-classification result. For example, divide emotions into four categories: loss, calm, enthusiasm, and passion.
- the emotion recognition result is the probability of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the emotion with the highest probability; or, the emotion indicated by the emotion recognition result , Is an emotion with an indication mark; or, the emotion recognition result is the score of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the score interval in which the score falls Corresponding emotion.
- the emotion recognition result may be the emotion probability of the speech data, where the emotion (first emotion) indicated by the emotion recognition result is the emotion with the highest probability.
- the emotion recognition result output by the emotion classifier may be: loss 2%, calm 20%, enthusiasm 80%, and passion 60%. Then, the emotion indicated by the emotion recognition result is enthusiasm.
- the emotion recognition result can output a multi-classification result with one indicator.
- the emotion indicated by the emotion recognition result is an emotion with the indicator.
- the indication mark can be one or more of words, numbers, characters, and so on. for example. If 1 is an indicator, the emotion recognition result output by the emotion classifier is: loss 1, calm 0, enthusiasm 0, passion 0, then the emotion indicated by the emotion recognition result is: loss.
- emotion recognition results can also output emotion points, and each emotion also corresponds to a different score interval. Therefore, the emotion indicated by the emotion recognition result is a kind of score interval corresponding to the emotion point. emotion.
- FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application.
- the responding emotion can be determined through the following steps:
- Step 1922 Extract the voice features of the voice data.
- Step 1924 Use the trained emotion classifier to process the voice features to obtain the emotion recognition result.
- steps 1922 to 1924 are the same as before, and will not be described in detail.
- Step 1926 Convert the voice data into text data.
- Steps 1926 to 1928 are used to obtain emotional analysis results from the perspective of content. It should be understood that there is no correlation in the order of execution between steps 1921 to 1924 and steps 1926 to 1928. Except for the sequential execution of steps 1922 and 1924, and the sequential execution of 1926 and 1928, some embodiments of this specification have no relation to the order of execution of these steps. Especially limited, it can be executed sequentially as shown in FIG. 21, or can be executed simultaneously, or after step 1922 is executed, step 1926, etc. are started to be executed, which is not exhaustive.
- the voice data can be converted into text data through a voice decoder, which will not be described in detail.
- Step 1928 Perform sentiment analysis on the text data to obtain an sentiment analysis result.
- emotion-related words in the text data can be identified, and then, based on the emotion-related words, the emotion analysis result of the text data is determined.
- the emotion-related words in the text data can be identified, and then, based on the emotion-related words, the emotion analysis result of the text data is determined.
- the sentiment score can also be preset for each sentiment related word. Therefore, all the emotional related words in the text data can be identified, and then the emotional scores of these emotional related words can be weighted (also can be directly summed or averaged), and then the weighted scores are used as the emotional analysis result.
- Step 19210 Determine the responding emotion according to the emotion recognition result and the emotion analysis result.
- the two can be weighted (summing or averaging), and then the weighted value falls in the emotional interval corresponding to one This kind of emotion, as the responding emotion. If one or more of the two are not in the form of scoring, the emotion recognition result (or the result of emotion analysis) can be converted into the scoring form according to a preset algorithm, and then weighting is performed to determine the response emotion.
- the emotion category indicated by the emotion recognition result and the emotion analysis result are taken as the respondent emotion.
- the emotion recognition results and the emotion analysis results are weighted, and the emotion categories indicated after the weighting process are used as the responding emotions. (Weighted after conversion into points, as before, no longer repeat).
- determining the response emotion from the voice data and the text data converted from the voice data it can start from the two dimensions of sound and content (text), and more comprehensively analyze the emotional state of the voice data sent by the user, which is beneficial to improve the accuracy of the recognition result. , In turn, shorten the gap between the response voice and the user’s emotional needs, making it more humane and more realistic.
- FIG. 22 is a block diagram of a terminal according to some embodiments of the present application.
- FIGS. 5, 8-9, and FIGS. 19-21 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
- the determination module 420 may include an identification module 2210 and a response emotion determination module 2220.
- the terminal 2200 may include: a collection module 1810, an identification module 2210, a response emotion determination module 2220, and an output module 1830.
- the collection module 1810 can also be used to collect current voice data.
- the recognition module 2210 is used to recognize the responding emotion of the voice data.
- the response emotion determination module 2220 is used to determine the response emotion corresponding to the response emotion state.
- the output module 1830 is also used to output a response voice for voice data, and the response voice has a response emotion.
- the recognition module 2210 may be used to extract voice features of voice data.
- the recognition module 2210 can be used to process voice features using a trained emotion classifier to obtain emotion recognition results.
- the recognition module 2210 can be used to convert voice data into text data; perform sentiment analysis on the text data to obtain an sentiment analysis result.
- the recognition module 2210 may be used to determine the emotion analysis result according to the emotion related words recognized from the text data.
- the recognition module 2210 may also be used to determine the responding emotion according to the emotion recognition result and the emotion analysis result.
- the acquisition module 440 may include a training module (not shown in FIG. 22), and the training module may be used to acquire a second classification model (also referred to as an emotion classifier) through training.
- a second classification model also referred to as an emotion classifier
- the response module 430 may include a generation module (not shown in Fig. 22), and the generation module may be used to generate a response voice according to the response emotion and the response content.
- the terminal will determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user.
- This processing method can only achieve responses to interactive instructions, and the human-computer interaction method is too monotonous and cannot meet the user's personalized interaction needs. For example, in the aforementioned exaggeration scene, if the user says “praise me”, the terminal will output the default compliment content in response to this human-computer interaction command. Different users say “praise me”, and the praise content output by the terminal is the same, which is obviously difficult to meet the user's personalized interaction needs, and the human-computer interaction experience is also poor.
- Fig. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- the third classification model can also be called a trained style classifier ("style classifier" for short).
- style classifier for short.
- the method of human-computer interaction includes:
- Step 2310 Receive an interactive instruction for the target object.
- the interactive instruction refers to receiving voice data or text data from the user. Taking the aforementioned exaggeration scenario as an example, here can be the text data "praise me” received from the user, or the voice data “praise me” sent by the user is collected.
- the user may be the user to which the terminal to which it is applied belongs.
- the target object may be a user to which the current terminal belongs; or, the target object is a counterparty user who communicates with the current terminal. That is, the interactive instruction can be directed to different target objects, and the specific example is similar to that of specifying the words to target different target objects. For details, refer to step 1710, which will not be repeated here.
- Step 2320 Determine the response style of the target object, and the response style is related to the historical data of the target object.
- Historical data can directly or laterally reflect the response style of the target object's personal preference. Therefore, the response style of the target object can be determined based on the historical data. See Figure 24 and related descriptions for details.
- Step 2330 Determine the response language according to the response style and the interactive command.
- a response language with the response style can be obtained.
- the content of the response words is related to the interactive instructions. For example, if the received interactive instruction is "Praise the driver”, then the content of the response word is praise to the driver; if the received interactive instruction is "Praise the passenger”, then the content of the response word is Compliments to passengers.
- the candidate speech skills corresponding to each response style can be preset, so that one response speech strategy can be determined from the multiple candidate speech skills corresponding to the determined response style.
- step 530 and related descriptions for determining the response language from a plurality of candidate words can be preset, so that one response speech strategy can be determined from the multiple candidate speech skills corresponding to the determined response style.
- the priority of the candidate speech corresponding to each response style may be preset, and the candidate speech with a higher priority is preferentially selected as the response speech.
- the interaction instruction may be parsed to obtain the emotional style of the interaction instruction, and then the response style and the emotional style are combined to determine the response style, and furthermore, the response language corresponding to the response style is determined.
- the first style is a style determined based on the first feature or/and the second feature.
- the first feature is a feature determined based on the interactive instruction.
- the first style includes the emotional style of the interactive instruction.
- the emotional style of the interactive instruction may include the responding emotion of the interactive instruction. For more details, please refer to FIG. 8 or FIG. 19 and related descriptions. For example, when the recognition result of the interactive instruction contains very positive keywords such as "very good” and "extremely praised", the user's personalized emotional style is more inclined to adopt a very enthusiastic praise style.
- the normalized scores of the two can be weighted by the normalized numerical post-processing of the two (also can be directly summed or averaged), Then, the style corresponding to the weighted score is used as the response style corresponding to the interactive command.
- the response style is consistent with the emotional style. If they are the same, the style indicated by the two is the response style. If the two are inconsistent, the aforementioned weighting method can be used to determine the response style.
- the text content of the speech recognition, the behavior of the target object and historical data such as account information are used as one of the reference factors for personalized response speech; in addition, the interactive instructions are parsed separately, As another reference factor for personalized response. Therefore, the two reference factors are comprehensively weighted, and the comprehensive weighted result is used as the final result to evaluate which type of personality tendency of the target object is within a period of time.
- this weighting result is not static, but the response style of the target object is regularly updated offline with the continuous update of the target object's usage data, so as to better adapt to the fluctuation of the target object in different periods (premise hypothesis It is that the personality of the target object is not single but diverse, and fluctuates with changes in the environment). Using this weighting method can better fit the personality tendency of the target object, and thus can better make personalized recommendations for the target object.
- Step 2340 output response words to the target object.
- step 530 based on the determined response words, it is only necessary to output the response words to the target object.
- the response style that the target object may like can be determined according to the historical data of the target object. Therefore, the response style and the interactive instruction can be combined to Determine and output the response words, so that the response words are closer to the personalized style of the target object. Even for the same interactive command, the response words for different target objects may be different, which solves the existing human-computer interaction methods. The problem of singularity and inability to meet the user's personalized interaction needs, which also makes the interaction process more real and interesting.
- Fig. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
- Step 2410 Obtain historical data of the target object.
- only the historical data of the target object in the most recent period of time for example, the most recent week, the most recent month, or the most recent three days may be acquired, so as to reduce the influence of distant historical data on the response style and make the response style more in line with the user Preferences for the current period of time.
- the response words output by the terminal can be the same or different. For example, if the user's preferences change, the terminal's determined response style will be different, and the output response skills may also be different.
- Step 2420 Process the historical data to obtain the object characteristics of the target object.
- the first feature is determined based on historical data.
- the first feature can be referred to as an object feature.
- step 2410 text data may be collected, or voice data may be collected.
- the text data corresponding to the voice data can be obtained by performing semantic recognition on the voice data.
- the extracted feature words may include, but are not limited to: word frequency features.
- Step 2430 Use the trained style classifier to process the object features to obtain the response style of the target object.
- the style classifier can be used to classify historical data.
- the style classifier can be trained and acquired. For the training and deployment of the style classifier, see Figure 11 and its related descriptions.
- the style classifier can be used offline, and it can be embodied as a model with a small parameter amount.
- the type of style classifier can be seen in Figure 5 and its related descriptions.
- the style recognition result output by the style classifier may be a multi-classification result.
- the response styles are divided into three styles according to the degree of praise: strong praise (strong praise, higher degree of praise), normal praise, and slight praise (slight praise, lower degree of praise).
- the multi-classification result output by the style classifier can be used to identify the style probability of the response style, so that the style with the highest probability indicated by the style classification result is used as the response style of the target object.
- the style recognition result output by the style classifier may be: exaggerated 70%, normal exaggerated 50%, and exaggerated 10%. Then, the response style indicated by the style recognition result is: exaggerated.
- the style recognition result can output a multi-classification result with one indicator.
- the style indicated by the style recognition result is a style with the indicator.
- the indication mark can be one or more of words, numbers, characters, etc. for example. If 1 is an indicator, and the style recognition result output by the style classifier is: exaggerated 0, normal exaggerated 1, slightly exaggerated 0, then the response style indicated by the style recognition result is: normal exaggeration.
- style recognition results can also output style scores, and each style also corresponds to a different score interval. Therefore, the style indicated by the style recognition result is the score interval corresponding to the style score. A style.
- FIG. 25 is a block diagram of a terminal according to some embodiments of the present application.
- FIGS. 5, 10-11, and 23-24 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
- the acquiring module 440 may include a receiving module 2510.
- the determining module 420 may include a response style determining module 2520.
- the response module 430 may include a response speech determination module 2530.
- the terminal 2500 includes: a receiving module 2510, a response style determining module 2520, a response speech determining module 2530, and an output module 1830.
- the receiving module 2510 is used to receive interactive instructions for the target object.
- the response style determination module 2520 is used to determine the response style of the target object, and the response style is related to the historical data of the target object.
- the response style determination module 2520 may be used to process the acquired historical data to obtain the object characteristics of the target object; and use the trained style classifier to process the object characteristics to obtain the response style of the target object.
- the response speech technique determining module 2530 is used to determine the response speech technique according to the response style and the interactive command.
- the output module 1830 is used to output response words to the target object.
- the terminal 2500 further includes a training module (not shown in FIG. 25), which is used to train the style classifier.
- the training module is also used to determine the style label and update the style classifier.
- the terminal (for example, the terminal 1800, 2200, or 2500) in this specification may be a server or a terminal.
- the above module diagram 400 shown in FIG. 4, the terminal 1800 shown in FIG. 18, the terminal 2200 shown in FIG. 22, and the terminal 2500 shown in FIG. All or part of it is integrated into one physical entity, or it can be physically separated.
- these modules can all be implemented in the form of software called by processing elements; they can also be implemented in the form of hardware; part of the modules can be implemented in the form of software called by the processing elements, and some of the modules can be implemented in the form of hardware.
- the extraction module 410 may be a separately established processing element, or it may be integrated in the terminal 1800, for example, implemented in a certain chip of the terminal. In addition, it may also be stored in the memory of the terminal 1800 in the form of a program.
- a certain processing element calls and executes the functions of the above modules.
- the implementation of other modules is similar.
- the implementation of other modules is similar.
- all or part of these modules can be integrated together or implemented independently.
- the processing element described here may be an integrated circuit with signal processing capability.
- each step of the above method or each of the above modules can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
- the above modules may be one or more integrated circuits configured to implement the above methods, such as one or more application specific integrated circuits (ASIC), or one or more microprocessors (digital singnal processor, DSP), or, one or more field programmable gate arrays (Field Programmable Gate Array, FPGA), etc.
- ASIC application specific integrated circuits
- DSP digital singnal processor
- FPGA Field Programmable Gate Array
- the processing element may be a general-purpose processor, such as a central processing unit (CPU) or other processors that can call programs.
- CPU central processing unit
- these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).
- numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifier "about”, “approximately” or “substantially” in some examples. Retouch. Unless otherwise stated, “approximately”, “approximately” or “substantially” indicates that the number is allowed to vary by ⁇ 20%.
- the numerical parameters used in the specification and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Disclosed in an embodiment of the present invention is a method for man-machine interaction. The method for man-machine interaction comprises: extracting an associated feature of a target object on the basis of an interaction instruction directed at the target object, the associated feature being related to at least one of the interaction instruction and historical data of the target object; determining a response policy for the target object by processing the associated feature, the response policy being related to at least one of response content, a response style and a response emotion; and determining a response message for the target object on the basis of the response policy.
Description
交叉引用cross reference
本申请要求2020年1月8日递交的申请号为202010016725.5的中国申请,2020年1月8日递交的申请号为202010017735.0的中国申请的优先权以及2020年1月8日递交的申请号为202010018047.6的中国申请的优先权,其所有内容通过引用的方式包含于此。This application requires the priority of the Chinese application with the application number 202010016725.5 filed on January 8, 2020, the Chinese application with the application number 202010017735.0 filed on January 8, 2020, and the application number 202010018047.6 filed on January 8, 2020. For the priority of the Chinese application, all the contents of which are included here by reference.
本说明书涉及计算机技术领域,特别涉及一种人机交互方法和系统。This specification relates to the field of computer technology, in particular to a human-computer interaction method and system.
随着计算机技术的发展,终端可以实现对用户的自动应答,以实现人机交互。一般地,终端可以根据用户发出的语音或文字,确定与这些语音或文字相对应的应答内容,并将应答内容输出给用户。With the development of computer technology, terminals can realize automatic responses to users to realize human-computer interaction. Generally, the terminal can determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user.
然而,当前的人机交互中,可能只能对单一的语言(例如,中文或英文)给出应答。而且,在应答时,往往未考虑到用户的情感需求或风格需求,例如,终端一般会按照默认的语调、语气或内容来进行应答,但不能满足用户的个性化交互需求。However, in the current human-computer interaction, it may only be possible to give a response in a single language (for example, Chinese or English). Moreover, when responding, the user's emotional needs or style requirements are often not considered. For example, the terminal generally responds according to the default tone, tone or content, but it cannot meet the user's personalized interaction needs.
为此,本说明书实施例提出一种人机交互的方法和系统,以实现针对不同语种的个性化交互。For this reason, the embodiment of this specification proposes a human-computer interaction method and system to realize personalized interaction for different languages.
发明内容Summary of the invention
本说明书实施例之一提供一种人机交互的方法,所述方法包括:基于针对目标对象的交互指令提取所述目标对象的关联特征,所述关联特征与所述交互指令和所述目标对象的历史数据中的至少一种相关;所述关联特征包括所述交互指令对应的第一特征;若所述交互指令不包括语音数据,所述第一特征包括所述交互指令对应的文本数据的文本特征;若所述交互指令包括语音数据,所述第一特征包括所述语音数据的语音特征和所述交互指令对应的文本数据的文本特征中的至少一种;基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,所述响应策略与响应内容、响应风格和响应情感中的至少一种相关;以及,基于所述响应策略,确定针对所述目标对象的响应话术。One of the embodiments of this specification provides a method of human-computer interaction, the method includes: extracting an associated feature of the target object based on an interaction instruction directed to the target object, and the associated feature is related to the interaction instruction and the target object At least one of the historical data; the associated feature includes the first feature corresponding to the interactive instruction; if the interactive instruction does not include voice data, the first feature includes the text data corresponding to the interactive instruction Text features; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; based on the correlation feature Processing, determining a response strategy for the target object, the response strategy being related to at least one of response content, response style, and response emotion; and, based on the response strategy, determining a response strategy for the target object .
本说明书实施例之一提供一种人机交互的系统,所述系统包括:提取模块,用于基于针对目标对象的交互指令提取所述目标对象的关联特征,所述关联特征与所述交互指令和所述目标对象的历史数据中的至少一种相关;所述关联特征包括所述交互指令对应的第一特征;若所述交互指令不包括语音数据,所述第一特征包括所述交互指令对应的文本数据的文本特征;若所述交互指令包括语音数据,所述第一特征包括所述语音数据的语音特征和所述交互 指令对应的文本数据的文本特征中的至少一种;确定模块,用于基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,所述响应策略与响应内容、响应风格和响应情感中的至少一种相关;以及,响应模块,用于基于所述响应策略,确定针对所述目标对象的响应话术。One of the embodiments of the present specification provides a human-computer interaction system, the system includes: an extraction module for extracting the associated feature of the target object based on the interaction instruction for the target object, the associated feature and the interaction instruction Related to at least one of the historical data of the target object; the associated feature includes the first feature corresponding to the interaction instruction; if the interaction instruction does not include voice data, the first feature includes the interaction instruction The text feature of the corresponding text data; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; determining module , For determining a response strategy for the target object based on processing the associated features, the response strategy being related to at least one of response content, response style, and response emotion; and, a response module for determining a response based on The response strategy determines the response words for the target object.
本申请实施例之一提供一种计算机可读存储介质,所述存储介质存储计算机指令,当计算机读取存储介质中的计算机指令后,计算机执行如前所述的人机交互的方法。One of the embodiments of the present application provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned human-computer interaction method.
本说明书将以示例性实施例的方式进一步说明,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构,其中:This specification will be further described in the form of exemplary embodiments, and these exemplary embodiments will be described in detail with the accompanying drawings. These embodiments are not restrictive. In these embodiments, the same number represents the same structure, in which:
图1是根据本申请一些实施例所示的人机交互的系统的应用场景示意图;Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application;
图2是根据本申请一些实施例所示的示例性计算设备的示例性硬件组件和/或软件组件的示意图;Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application;
图3是根据本申请一些实施例所示的示例性移动设备的示例性硬件组件和/或软件组件的示意图;Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application;
图4是根据本申请一些实施例所示的示例性处理设备的模块图;Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application;
图5是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图6是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图7是根据本申请一些实施例所示的第一分类模型训练的示例性流程图;Fig. 7 is an exemplary flow chart of training a first classification model according to some embodiments of the present application;
图8是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图9是根据本申请一些实施例所示的第二分类模型训练的示例性流程图;Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application;
图10是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图11是根据本申请一些实施例所示的训练第三分类模型的示例性流程图;Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application;
图12是根据本申请一些实施例所示的人机交互的示意图;FIG. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application;
图13是根据本申请一些实施例所示的人机交互的示意图;FIG. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application;
图14是根据本申请一些实施例所示的人机交互的示意图;FIG. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application;
图15是根据本申请一些实施例所示的人机交互的示意图;FIG. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application;
图16是根据本申请一些实施例所示的人机交互的示意图;FIG. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application;
图17是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图18是根据本申请一些实施例所示的终端的模块图;Figure 18 is a block diagram of a terminal according to some embodiments of the present application;
图19是根据本申请一些实施例所示的人机交互的方法的示例性流程图;Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图20是根据本申请一些实施例所示的确定被响应情感的方法的示例性流程图;FIG. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application;
图21是根据本申请一些实施例所示的确定被响应情感的方法的示例性流程图;FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application;
图22是根据本申请一些实施例所示的终端的模块图;Figure 22 is a block diagram of a terminal according to some embodiments of the present application;
图23是根据本申请一些实施例所示的人机交互的方法的示例性流程图;FIG. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图24是根据本申请一些实施例所示的人机交互的方法的示例性流程图;FIG. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;
图25是根据本申请一些实施例所示的终端的模块图。Fig. 25 is a block diagram of a terminal according to some embodiments of the present application.
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本说明书的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本说明书应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。In order to more clearly describe the technical solutions of the embodiments of the present specification, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some examples or embodiments of this specification. For those of ordinary skill in the art, without creative work, this specification can also be applied to these drawings. Other similar scenarios. Unless it is obvious from the language environment or otherwise stated, the same reference numerals in the figures represent the same structure or operation.
应当理解,本文使用的“系统”、“装置”、“单元”和/或“模块”是用于区分不同级别的不同组件、元件、部件、部分或装配的一种方法。然而,如果其他词语可实现相同的目的,则可通过其他表达来替换所述词语。It should be understood that “system”, “device”, “unit” and/or “module” used herein is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。As shown in this specification and claims, unless the context clearly indicates exceptions, the words "a", "an", "an" and/or "the" do not specifically refer to the singular, but may also include the plural. Generally speaking, the terms "include" and "include" only suggest that the clearly identified steps and elements are included, and these steps and elements do not constitute an exclusive list, and the method or device may also include other steps or elements.
本说明书中使用了流程图用来说明根据本说明书的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。In this specification, a flowchart is used to illustrate the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed exactly in order. Instead, the steps can be processed in reverse order or at the same time. At the same time, other operations can be added to these processes, or a certain step or several operations can be removed from these processes.
图1是根据本申请一些实施例所示的人机交互的系统的应用场景示意图。如图1所示,人机交互的系统100可以包括服务器110、网络120、第一用户端130、第二用户端140和存储器150。Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application. As shown in FIG. 1, the human-computer interaction system 100 may include a server 110, a network 120, a first client 130, a second client 140 and a storage 150.
服务器110可以处理从系统100的至少一个组件(例如,第一用户端130、第二用户端140和存储器150)或外部数据源(例如,云数据中心)获取的数据和/或信息。例如,服务器110可以从第一用户端130(例如,乘客端)获取交互指令。又例如,服务器110还可以从存储器150获取历史数据。The server 110 may process data and/or information obtained from at least one component of the system 100 (for example, the first client 130, the second client 140, and the storage 150) or an external data source (for example, a cloud data center). For example, the server 110 may obtain the interaction instruction from the first user terminal 130 (for example, the passenger terminal). For another example, the server 110 may also obtain historical data from the storage 150.
在一些实施例中,服务器110可以包括处理设备112。处理设备112可以处理与人机交互系统相关的信息和/或数据以执行本说明书中描述的一个或多个功能。例如,处理设备112可以基于交互指令和/或历史数据确定响应话术。在一些实施例中,处理设备112可包括至少一个处理单元(例如,单核处理引擎或多核处理引擎)。在一些实施例中,处理设备112可以为第一用户端130和/或第二用户端140的一部分。In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process information and/or data related to the human-computer interaction system to perform one or more functions described in this specification. For example, the processing device 112 may determine response speech based on interactive instructions and/or historical data. In some embodiments, the processing device 112 may include at least one processing unit (for example, a single-core processing engine or a multi-core processing engine). In some embodiments, the processing device 112 may be a part of the first client 130 and/or the second client 140.
网络120可以提供信息交换的渠道。在一些实施例中,网络120可以包括一个或多个网络接入点。系统100的一个或多个部件可以通过接入点连接到网络120以交换数据和/或信息。在一些实施例中,系统100中的至少一个组件可以经由网络120访问存储在存储器150中的数据或指令。The network 120 may provide channels for information exchange. In some embodiments, the network 120 may include one or more network access points. One or more components of the system 100 may be connected to the network 120 through an access point to exchange data and/or information. In some embodiments, at least one component in the system 100 can access data or instructions stored in the memory 150 via the network 120.
第一用户端130的所有者可以是用户本人或用户本人之外的其他人。例如,第一用户端130的所有者A可以使用第一用户端130发送针对用户B的服务请求。在一些实施例中,第一用户端130可以包括各类具有信息接收和/或发送功能的设备。第一用户端130可以处理信息和/或数据。在一些实施例中,第一用户端130可以是具有定位功能的设备。第一用户端130可以是具有显示功能的设备,显示服务器110反馈给第一用户端130的响应话术,显示的方式可以是界面、弹窗、浮窗、小窗、文字等。第一用户端130可以是具有语音功能的设备,从而可以播放服务器110反馈给第一用户端130的响应话术。The owner of the first user terminal 130 may be the user himself or someone other than the user himself. For example, the owner A of the first client 130 may use the first client 130 to send a service request for the user B. In some embodiments, the first client 130 may include various types of devices with information receiving and/or sending functions. The first client 130 can process information and/or data. In some embodiments, the first client 130 may be a device with a positioning function. The first client 130 may be a device with a display function, and the response words fed back to the first client 130 by the display server 110 may be displayed in an interface, pop-up window, floating window, small window, text, etc. The first client 130 may be a device with a voice function, so that the response words fed back by the server 110 to the first client 130 can be played.
第二用户端140可以与第一用户端130通信。在一些实施例中,第一用户端130和第二用户端140可以通过短距离通信装置进行通信。在一些实施例中,第二用户端140的类型可以与第一用户端130相同或不同。例如,第一用户端130和第二用户端140可以包括但不限于平板电脑、笔记本电脑、移动设备、台式电脑等或其任意组合。The second client 140 can communicate with the first client 130. In some embodiments, the first client 130 and the second client 140 may communicate through a short-range communication device. In some embodiments, the type of the second client 140 may be the same as or different from that of the first client 130. For example, the first client 130 and the second client 140 may include, but are not limited to, a tablet computer, a notebook computer, a mobile device, a desktop computer, etc., or any combination thereof.
在一些实施例中,存储器150可以存储处理设备112可以执行或使用以完成本说明书描述的示例性方法的数据和/或指令。例如,存储器150可以存储历史数据、用于确定响应话术的模型、响应话术的音频文件、文本文件等。在一些实施例中,存储器150可以作为后端存储器直接连接到服务器110。在一些实施例中,存储器150可以是服务器110、第一用户端130和/或第二用户端140的一部分。In some embodiments, the memory 150 may store data and/or instructions that can be executed or used by the processing device 112 to complete the exemplary methods described in this specification. For example, the memory 150 may store historical data, a model used to determine the response speech, an audio file of the response speech, a text file, and the like. In some embodiments, the storage 150 may be directly connected to the server 110 as a back-end storage. In some embodiments, the storage 150 may be a part of the server 110, the first client 130 and/or the second client 140.
图2是根据本申请一些实施例所示的示例性计算设备的示例性硬件组件和/或软件组件的示意图。如图2所示,计算设备200可以包括处理器210、存储器220、输入/输出230和通信端口240。Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application. As shown in FIG. 2, the computing device 200 may include a processor 210, a memory 220, an input/output 230, and a communication port 240.
处理器210可以执行计算指令(程序代码)并执行本发明描述的人机交互系统100的功能。所述计算指令可以包括程序、对象、组件、数据结构、过程、模块和功能(所述功能指 本发明中描述的特定功能)。例如,处理器210可以处理从人机交互系统100的其他任何组件获得的图像或文本数据。仅为了说明,图2中的计算设备200只描述了一个处理器,但需要注意的是本发明中的计算设备200还可以包括多个处理器。存储器220可以存储从按需服务系统100的任何其他组件获得的数据/信息。输入/输出230可以用于输入或输出信号、数据或信息。在一些实施例中,输入/输出230可以使用户与人机交互系统100进行联系。在一些实施例中,输入/输出230可以包括输入装置和输出装置。通信端口240可以连接到网络以便数据通信。在一些实施例中,通信端口240可以是标准化端口也可以是专门设计的端口。The processor 210 can execute calculation instructions (program code) and perform the functions of the human-computer interaction system 100 described in the present invention. The calculation instructions may include programs, objects, components, data structures, procedures, modules, and functions (the functions refer to the specific functions described in the present invention). For example, the processor 210 may process image or text data obtained from any other components of the human-computer interaction system 100. For illustration only, the computing device 200 in FIG. 2 only describes one processor, but it should be noted that the computing device 200 in the present invention may also include multiple processors. The memory 220 may store data/information obtained from any other components of the on-demand service system 100. The input/output 230 may be used to input or output signals, data or information. In some embodiments, the input/output 230 may enable the user to communicate with the human-computer interaction system 100. In some embodiments, the input/output 230 may include an input device and an output device. The communication port 240 may be connected to a network for data communication. In some embodiments, the communication port 240 may be a standardized port or a specially designed port.
图3是根据本申请一些实施例所示的示例性移动设备的示例性硬件组件和/或软件组件的示意图。Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application.
在一些实施例中,终端(例如,第一用户端130、第二用户端140)可以通过移动设备300实现。如图3所示,移动设备300可包括通信平台310、显示器320、图形处理器(GPU)330、中央处理器(CPU)340、输入/输出350、内存360和存储器390。在一些实施例中,移动设备300也可以包括任何其它合适的组件,包括但不限于系统总线或控制器(图中未显示)。在一些实施例中,移动操作系统370和一个或多个应用程序380可以从存储器390装载入内存360,以便能够由中央处理器340执行。应用程序380可以包括浏览器或任何其他合适的移动应用程序,用于从服务器110接收和呈现提示信息或其他信息相关的信息。信息流的用户交互可以通过输入/输出350实现,并且通过网络120提供给服务器110和/或人机交互系统100的其他组件。In some embodiments, the terminal (for example, the first user terminal 130 and the second user terminal 140) may be implemented by the mobile device 300. As shown in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an input/output 350, a memory 360, and a storage 390. In some embodiments, the mobile device 300 may also include any other suitable components, including but not limited to a system bus or a controller (not shown in the figure). In some embodiments, the mobile operating system 370 and one or more application programs 380 may be loaded from the memory 390 into the memory 360 so as to be executable by the central processing unit 340. The application program 380 may include a browser or any other suitable mobile application program for receiving and presenting prompt information or other information related information from the server 110. The user interaction of the information flow can be implemented through the input/output 350 and provided to the server 110 and/or other components of the human-computer interaction system 100 through the network 120.
图4是根据本申请一些实施例所示的示例性处理设备的模块图。如图4所示,该系统可以包括:提取模块410、确定模块420、响应模块430、获取模块440。Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application. As shown in FIG. 4, the system may include: an extraction module 410, a determination module 420, a response module 430, and an acquisition module 440.
提取模块410可以用于基于针对目标对象的交互指令提取目标对象的关联特征。关联特征包括交互指令对应的第一特征和/或历史数据对应的第二特征,第一特征包括交互指令中语音数据的语音特征和/或交互指令对应的文本数据的文本特征中的至少一种。在一些实施例中,提取模块410可以用于:提取关联特征之前,对交互指令进行预处理。更多细节参见图5及其相关描述。The extraction module 410 may be used to extract the associated features of the target object based on the interactive instruction for the target object. The associated feature includes a first feature corresponding to an interactive instruction and/or a second feature corresponding to historical data. The first feature includes at least one of a voice feature of the voice data in the interactive instruction and/or a text feature of the text data corresponding to the interactive instruction. . In some embodiments, the extraction module 410 may be used to preprocess the interaction instructions before extracting the associated features. See Figure 5 and related descriptions for more details.
确定模块420可以用于基于对关联特征进行处理,确定针对目标对象的响应策略。在一些实施例中,确定模块420可以用于:基于模型对关联特征进行处理,确定响应策略。模型可以是第一分类模型、第二分类模型和第三分类模型。更多细节参见图5、6、8和10及其相关描述。The determining module 420 may be used to determine a response strategy for the target object based on processing the associated features. In some embodiments, the determining module 420 may be used to process the associated features based on the model and determine a response strategy. The model may be a first classification model, a second classification model, and a third classification model. See Figures 5, 6, 8 and 10 and related descriptions for more details.
响应模块430可以用于基于响应策略,确定和输出针对目标对象的响应话术。在一些 实施例中,响应话术通过响应文本和/或响应语音的方式输出给目标对象。The response module 430 may be used to determine and output response words for the target object based on the response strategy. In some embodiments, the response words are output to the target object by responding to text and/or responding to speech.
获取模块440可以用于获取交互指令和模型。例如,获取模块440可以通过训练过程获取模型。获取模块440可以用于获取训练样本。训练样本包括第一训练样本、第二训练样本和第三训练样本。更多细节参见图7、9和11及其相关描述。The obtaining module 440 may be used to obtain interactive instructions and models. For example, the obtaining module 440 may obtain the model through a training process. The obtaining module 440 may be used to obtain training samples. The training sample includes a first training sample, a second training sample, and a third training sample. See Figures 7, 9 and 11 and related descriptions for more details.
图5是根据本申请一些实施例所示的人机交互的方法的示例性流程图。在一些实施例中,流程500可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 500 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤510,基于针对目标对象的交互指令提取目标对象的关联特征,关联特征与交互指令和目标对象的历史数据中的至少一种相关。在一些实施例中,步骤510可以由提取模块410执行。Step 510: Extract the associated feature of the target object based on the interaction instruction for the target object, where the associated feature is related to at least one of the interaction instruction and historical data of the target object. In some embodiments, step 510 may be performed by the extraction module 410.
目标对象是指可以与系统或设备(例如,移动设备、终端设备等)相互传递信息的人员。例如,目标对象是指系统或设备需要做出响应的对象。目标对象可以包括设备关联对象(例如,使用者或通信者)、调试人员、测试人员、实施人员、维护人员、客服人员等。The target audience refers to people who can communicate information with systems or devices (for example, mobile devices, terminal devices, etc.). For example, the target object refers to the object that the system or device needs to respond to. Target objects may include device-associated objects (for example, users or communicators), debuggers, testers, implementation personnel, maintenance personnel, customer service personnel, and so on.
在一些实施例中,目标对象为当前终端所属用户。例如,乘客端所属乘客用户,乘客端所属乘客用户等。当前终端可以指进行人机交互的设备或系统,例如,可以是接收交互指令的终端。可以理解的,目标对象为当前终端所属用户时,目标对象实际为发起交互指令的用户本身。在一些实施例中,目标对象为与当前终端存在通信的对方用户。例如,与乘客端存在通信的司机用户与司机端通信的乘客用户等。In some embodiments, the target object is a user to which the current terminal belongs. For example, the passenger user belongs to the passenger terminal, the passenger user belongs to the passenger terminal, and so on. The current terminal may refer to a device or system that performs human-computer interaction, for example, it may be a terminal that receives interactive instructions. It is understandable that when the target object is the user to which the current terminal belongs, the target object is actually the user who initiated the interactive instruction. In some embodiments, the target object is a counterpart user who has communication with the current terminal. For example, a driver user who communicates with the passenger terminal and a passenger user who communicates with the driver terminal.
交互指令是指向设备发送的指示。针对目标对象的交互指令可以是指发往系统或设备,使之能确定应如何给出对目标对象响应的指令。在一些实施例中,可以是设备的关联对象(如设备的用户)向设备发送的交互指令。关联对象可以通过交互指令,向设备传达特定意图(例如,好评、点赞、夸夸司机、夸夸自己、投诉等),以使设备或系统给出相应的响应。在一些实施例中,交互指令可以从当前终端获取。Interactive instructions are instructions sent to the device. An interactive instruction for the target object may refer to an instruction sent to a system or device so that it can determine how to give a response to the target object. In some embodiments, it may be an interaction instruction sent to the device by an associated object of the device (such as a user of the device). The associated objects can convey specific intentions to the device through interactive instructions (for example, praise, praise, praise the driver, praise oneself, complain, etc.), so that the device or system can give a corresponding response. In some embodiments, the interactive instruction can be obtained from the current terminal.
在一些实施例中,交互指令的形式可以是语音、文本、视频、图像、面部动作、手势、触屏操作等及其任意组合。其中,语音可以是中文、英文、法文、日文等中的一种或任意组合。In some embodiments, the interactive instruction may be in the form of voice, text, video, image, facial motion, gesture, touch screen operation, etc., and any combination thereof. Among them, the voice can be one or any combination of Chinese, English, French, Japanese, etc.
关联特征可以是指与目标对象、交互指令的发送者和/或交互指令相关的特征。在一些实施例中,关联特征可以与交互指令和目标对象的历史数据中的至少一种相关。The associated feature may refer to the feature related to the target object, the sender of the interactive instruction, and/or the interactive instruction. In some embodiments, the association feature may be related to at least one of the interaction instruction and the historical data of the target object.
在一些实施例中,关联特征包括交互指令对应的第一特征。可以理解的,第一特征为基于交互指令得到的特征。In some embodiments, the associated feature includes the first feature corresponding to the interactive instruction. It can be understood that the first feature is a feature obtained based on an interactive instruction.
如前所述,交互指令的形式可以是多种,相应的,交互指令对应的特征也可以是多种。 在一些实施例中,第一特征可以包括交互指令中语音数据的语音特征、交互指令对应的文本数据的文本特征、交互指令中图像数据对应的图像特征等中的至少一种。交互指令对应的文本数据的文本特征可以是交互指令中本身包含的文本数据,还可以是通过对交互指令中的语音数据和/或其他数据(例如,图像数据)等进行识别后得到的文本数据。例如,通过语音解码器等识别得到。如前所述,语音数据可以是多语种的语音数据,相应的,语音解码器可以与语种存在一对多或一对一的关系,即一个语音解码器可以将多个语种的语音数据转化为文本数据,或者,一个语音解码器仅仅可以解码某一个语种的语音数据。As mentioned above, there can be multiple forms of the interactive instruction, and correspondingly, the characteristics corresponding to the interactive instruction can also be multiple. In some embodiments, the first feature may include at least one of the voice feature of the voice data in the interaction instruction, the text feature of the text data corresponding to the interaction instruction, and the image feature corresponding to the image data in the interaction instruction. The text feature of the text data corresponding to the interactive instruction can be the text data contained in the interactive instruction itself, or the text data obtained by recognizing voice data and/or other data (for example, image data) in the interactive instruction, etc. . For example, it can be recognized by a speech decoder. As mentioned above, the voice data can be multilingual voice data. Correspondingly, the voice decoder can have a one-to-many or one-to-one relationship with the language, that is, a voice decoder can convert the voice data of multiple languages into Text data, or, a speech decoder can only decode speech data in a certain language.
在一些实施例中,交互指令可以包含语音数据,第一特征可以包含语音数据对应的语音特征和/或交互指令对应的文本数据的文本特征。如前所述,语音可以是多语种,语音数据可以是多语种对应的数据,语音特征可以是多语种语音数据对应的语音特征。In some embodiments, the interactive instruction may include voice data, and the first feature may include a voice feature corresponding to the voice data and/or a text feature of text data corresponding to the interactive instruction. As mentioned above, the voice may be multilingual, the voice data may be data corresponding to multiple languages, and the voice feature may be the voice feature corresponding to the multilingual voice data.
在一些实施例中,交互数据可以不包括语音数据,第一特征可以包括交互指令对应的文本数据的文本特征。In some embodiments, the interaction data may not include voice data, and the first feature may include the text feature of the text data corresponding to the interaction instruction.
在一些实施例中,第一特征还可以包括交互指令中其他数据对应的特征,例如,若交互指令包含图像数据,则第一特征包括图像数据对应的图像特征等。又例如,若交互指令中包含手势、姿势等动作,则第一特征可以包括屏幕手势特征、面部特征、指纹特征等。屏幕手势特征表示交互指令中的屏幕操作信息,例如滑动、翻页、触摸等操作。面部特征表示交互指令中用户的面部特征,例如,处理设备可以根据不同的面部特征得到不同的交互指令,面部特征还可以包括瞳孔特征、五官特征、虹膜特征等。指纹特征表示使用者的手指的指纹信息,例如,处理设备可以根据不同的指纹特征得到不同的交互指令。In some embodiments, the first feature may also include features corresponding to other data in the interactive instruction. For example, if the interactive instruction includes image data, the first feature includes image features corresponding to the image data. For another example, if the interaction instruction includes gestures, postures, and other actions, the first feature may include screen gesture features, facial features, fingerprint features, and the like. Screen gesture features represent screen operation information in interactive instructions, such as operations such as sliding, turning pages, and touching. The facial features represent the facial features of the user in the interactive instructions. For example, the processing device can obtain different interactive instructions according to different facial features. The facial features can also include pupil features, facial features, iris features, and the like. The fingerprint feature represents the fingerprint information of the user's finger. For example, the processing device can obtain different interaction commands according to different fingerprint features.
在一些实施例中,语音特征包括语音数据的音频特征、能量特征的一种或多种的组合。In some embodiments, the voice feature includes one or a combination of audio features and energy features of the voice data.
语音数据的音频特征是指语音数据音频的特征。在一些实施例中,音频特征可以包括基频特征、短时能量特征、短时赋值特征、短时过零率特征等中的至少一种。The audio feature of the voice data refers to the audio feature of the voice data. In some embodiments, the audio feature may include at least one of a fundamental frequency feature, a short-term energy feature, a short-term assignment feature, a short-term zero-crossing rate feature, and the like.
基频特征是指语音数据中声音频率的特征。其中,基频对应着声带振动的频率,代表声音的音高,声带振动越快,基频越高。语音数据的基频特征可以用于检测语音噪声、特殊声音检测、男女判别、说话人识别、参数自适应等。The fundamental frequency characteristic refers to the characteristic of the sound frequency in the speech data. Among them, the fundamental frequency corresponds to the frequency of vocal cord vibration and represents the pitch of the sound. The faster the vocal cord vibration, the higher the fundamental frequency. The fundamental frequency characteristics of speech data can be used to detect speech noise, special sound detection, gender discrimination, speaker recognition, parameter adaptation, and so on.
短时能量特征是指在一个短时音频帧内,采样点信号所聚集的平均能量。例如,一段连续音频信号流x得到K个采样点,这样K个采样点可以被分割成M个短时帧,每个短时帧和窗口函数大小假定为N,对于第m个短时帧则可根据公式
计算其短时能量。
The short-term energy feature refers to the average energy gathered by the sampling point signal in a short-term audio frame. For example, a continuous audio signal stream x gets K sampling points, so that K sampling points can be divided into M short-term frames, and the size of each short-term frame and the window function is assumed to be N. For the m-th short-term frame, According to the formula Calculate its short-term energy.
短时过零率特征是指每帧内信号通过零值的次数。对有时间横轴的连续语音信号,可 以观察到语音的时域波形通过横轴的情况。在离散时间语音信号情况下,如果相邻的采样具有不同的代数符号就称为发生了过零,因此可以计算过零的次数,即过零率。过零率在一定程度上可以反映信号的频率信息。短时过零率可以用于语音数据清浊音的判断。过零率高,则是清音,过零率低,则是浊音。The short-time zero-crossing rate feature refers to the number of times the signal passes the zero value in each frame. For continuous speech signals with a horizontal axis of time, it can be observed that the time-domain waveform of the speech passes through the horizontal axis. In the case of discrete-time speech signals, if adjacent samples have different algebraic signs, it is called zero-crossing, so the number of zero-crossings can be calculated, that is, the zero-crossing rate. The zero-crossing rate can reflect the frequency information of the signal to a certain extent. The short-time zero-crossing rate can be used to judge the unvoiced and voiced speech data. A high zero-crossing rate means unvoiced sound, and a low zero-crossing rate means voiced sound.
语音数据的能量特征是指语音数据频域上的能量分布,不同的能量分布代表不同的语音特征。其中,频域是描述语音信号在频率方面特性时用到的一种坐标系,在一些实施例中,频域图可以显示在一个频率范围内每个给定频带内语音信号的能量值。The energy feature of voice data refers to the energy distribution in the frequency domain of the voice data, and different energy distributions represent different voice features. Among them, the frequency domain is a coordinate system used to describe the frequency characteristics of the speech signal. In some embodiments, the frequency domain diagram may display the energy value of the speech signal in each given frequency band within a frequency range.
在一些实施例中,能量特征至少包括Fbank特征和梅尔频率倒谱(mel-frequency cepstrum,MFCC)特征中的至少一种。In some embodiments, the energy feature includes at least one of Fbank feature and mel-frequency cepstrum (MFCC) feature.
MFCC特征是指通过MFCC方法获得的语音信号的特征。MFCC特征具有较好的判别度,用于识别不同的声音。在一些实施例中,MFCC特征常用于自动语音和说话人识别。关于Fbank特征及其提取,具体参见图17及其相关描述,此处不再赘述。The MFCC feature refers to the feature of the voice signal obtained by the MFCC method. MFCC features have a good degree of discrimination and are used to identify different sounds. In some embodiments, MFCC features are commonly used for automatic speech and speaker recognition. Regarding the Fbank feature and its extraction, please refer to FIG. 17 and its related description for details, which will not be repeated here.
在一些实施例中,语音特征还包括线性预测分析(LinearPredictionCoefficients,LPC)、感知线性预测系数(PerceptualLinearPredictive,PLP)、Tandem特征、Bottleneck特征、线性预测倒谱系数(LinearPredictiveCepstralCoefficient,LPCC)、共振波、Bark谱等。In some embodiments, the voice features further include linear prediction analysis (Linear Prediction Coefficients, LPC), perceptual linear prediction coefficients (Perceptual Linear Predictive, PLP), Tandem features, Botleneck features, Linear Predictive Cepstral Coefficients (Linear Predictive Cepstral Coefficient, LPCC), resonance waves, Bark Spectrum and so on.
在一些实施例中,可以通过算法或模型提取语音特征。利用与语音特征类型对应的算法提取对应类型的语音特征。例如,通过三角带通滤波器算法提取MFCC特征。In some embodiments, speech features can be extracted through algorithms or models. The algorithm corresponding to the voice feature type is used to extract the voice feature of the corresponding type. For example, the MFCC feature is extracted through the triangular band-pass filter algorithm.
在一些实施例中,提取所述语音特征之前,可以对语音数据进行预处理,预处理包括分帧处理、预增强处理、加窗处理和噪声处理中的至少一种。In some embodiments, before extracting the voice feature, the voice data may be pre-processed, and the pre-processing includes at least one of framing processing, pre-enhancement processing, windowing processing, and noise processing.
分帧处理用于将语音数据切分为多个语音片段,缩减每次处理的数据量。分帧处理时可以按照预定值或预定范围(例如,10ms~30ms为一帧)的方式切分。为避免遗漏,分帧时可以进行偏移,使相邻两帧之间存在重叠的部分。在一些实施例中,语音数据为短句,部分场景无需进行切分。Framing processing is used to divide voice data into multiple voice segments, reducing the amount of data processed each time. During the framing process, it can be divided according to a predetermined value or a predetermined range (for example, 10 ms to 30 ms is a frame). In order to avoid omissions, an offset can be performed during frame division, so that there is an overlap between two adjacent frames. In some embodiments, the voice data is a short sentence, and some scenes do not need to be segmented.
预增强处理用于加强高频部分。预增强可以将语音数据通过一个高通滤波器实现。Pre-enhancement processing is used to enhance the high frequency part. Pre-enhancement can be achieved by passing the voice data through a high-pass filter.
加窗处理用于消除各个帧两端可能会造成的信号不连续性。例如,将每一帧乘以汉明窗,以增加帧左端和右端的连续性。Windowing is used to eliminate signal discontinuity that may be caused at both ends of each frame. For example, multiply each frame by the Hamming window to increase the continuity between the left and right ends of the frame.
噪声处理可以为添加随机噪声的处理,该处理能够解决合成音频的处理错漏。噪声处理还可以包括降噪处理。降噪处理可以通过降噪算法实现,降噪算法可以包括自适应滤波器、谱减法、维纳滤波法等。在一些实施例中,语音数据为实时采集的数据,部分场景中无需进行噪声处理等。The noise processing may be processing of adding random noise, which can solve the processing errors and omissions of the synthesized audio. Noise processing may also include noise reduction processing. The noise reduction processing can be achieved by noise reduction algorithms, which can include adaptive filters, spectral subtraction, Wiener filtering, and so on. In some embodiments, the voice data is data collected in real time, and noise processing is not required in some scenarios.
文本特征代表文本数据的相关信息,包括但不限于:关键词特征、语义特征、词频特征等。在一些实施例中,可以通过算法或模型提取文本特征,例如,通过LSTM、BERT、one-hot、词袋(Bag-of-words)模型、词频与逆向文件频率(term frequency–inverse document frequency,TF-IDF)模型、词汇表模型等。The text features represent relevant information of the text data, including but not limited to: keyword features, semantic features, word frequency features, etc. In some embodiments, the text features may be extracted through algorithms or models, for example, through LSTM, BERT, one-hot, bag-of-words model, term frequency and inverse document frequency (term frequency—inverse document frequency, TF-IDF) model, vocabulary model, etc.
目标对象的历史数据是指目标对象在过去一段时间,例如,最近一周、最近一月或最近三天产生的数据。历史数据可以包括但不限于:线上语音数据、个人账户数据、用户行为数据或离线记录数据中的一种或多种。The historical data of the target object refers to the data generated by the target object in the past period of time, for example, the last week, the last month, or the last three days. The historical data may include, but is not limited to: one or more of online voice data, personal account data, user behavior data, or offline record data.
线上语音数据来源于目标对象的线上语音,是目标对象在线上发出的语音。例如,过去一段时间内发出的线上语音。例如,可以是目标对象过去请求设备给出响应的语音交互指令。又例如,可以是目标对象过去与通信用户沟通的语音数据。在一些实施例中,可以将线上语音数据转为文本数据作为历史数据。例如,获取目标对象的历史线上语音,并对其进行语音识别后,得到相对应的文本,并将该文本作为一种历史数据。Online voice data comes from the online voice of the target object, which is the voice of the target object online. For example, online voices made in the past period of time. For example, it may be a voice interaction instruction in which the target object requested the device to give a response in the past. For another example, it may be the voice data of the target object communicating with the communication user in the past. In some embodiments, online voice data can be converted into text data as historical data. For example, after acquiring the historical online voice of the target object and performing voice recognition on it, the corresponding text is obtained, and the text is used as a kind of historical data.
个人账户数据来源于终端所属的目标对象的账户信息,个人账户数据可以包括但不限于:性格、职业、使用打车APP(Application,应用程序)的下单数、信誉分值、性别、账号使用年龄、历史订单数、历史订单中的上下车点、历史订单中的打车时间信息等。The personal account data comes from the account information of the target object to which the terminal belongs. The personal account data may include, but is not limited to: personality, occupation, number of orders using taxi apps (Application, application), reputation score, gender, age of account use, The number of historical orders, the pick-up and drop-off points in historical orders, the taxi time information in historical orders, etc.
目标对象行为数据可以是目标对象历史操作或反馈产生的数据。例如,可以是获取目标对象对历史推送的响应话术的评价反馈数据。又例如,可以是目标对象(例如,司机或乘客)对通信者(例如,乘客或司机)的评价信息。又例如,可以包括用户对历史订单的评价、与客服的聊天记录、对客服的评价、对系统的评价、对系统推送的信息的评价等信息。The target object behavior data may be data generated by historical operations or feedback of the target object. For example, it may be to obtain the evaluation feedback data of the target object's response to the historical push. For another example, it may be the evaluation information of the target object (for example, the driver or the passenger) to the correspondent (for example, the passenger or the driver). For another example, it may include the user's evaluation of historical orders, chat records with customer service, evaluation of customer service, evaluation of the system, evaluation of information pushed by the system, and other information.
离线记录数据可以是终端记录的数据。例如,可以是终端离线记录的数据。如,离线端记录的目标对象的历史语音识别或输入的文本。The offline recorded data may be data recorded by the terminal. For example, it can be data recorded offline by the terminal. For example, the historical voice recognition or input text of the target object recorded on the offline side.
在一些实施例中,历史数据还可以包括其他信息,包括但不限于用户的历史输入信息、用户的地理信息、用户的身份信息等。用户的历史输入信息包括用户的历史查询信息、视频输入信息、语音输入信息、文本输入信息、屏幕手势操作信息、解锁或开锁信息等。用户的地理信息包括用户的家庭地址、工作地址、活动范围等。用户的身份信息包括用户的年龄、工作、籍贯、身高、体重、收入情况等。In some embodiments, the historical data may also include other information, including but not limited to the user's historical input information, the user's geographic information, and the user's identity information. The user's historical input information includes the user's historical query information, video input information, voice input information, text input information, screen gesture operation information, unlocking or unlocking information, and so on. The user's geographic information includes the user's home address, work address, activity range, etc. The user's identity information includes the user's age, work, hometown, height, weight, income, etc.
应当理解,对于任意一个目标对象而言,其历史数据是在随时间推移而不断更新的。It should be understood that for any target object, its historical data is constantly updated over time.
第二特征可以为基于历史数据确定的特征,包括:基于目标对象的历史数据确定的特征和基于交互指令发送者的历史数据确定的特征。在一些实施例中,第二特征的类型可以根据历史数据的类型确定,例如,历史数据包含语音数据,则第二特征包含历史数据的语音特 征,对应的,第二特征的提取方式与第一特征类似,此处不再赘述。在一些实施例中,还可以基于历史数据提取时间序列行为特征,将目标对象或交互指令发送者的历史数据按时间的顺序,将历史数据中的一个或多个特定行为转化为序列特征,每个特征行为再序列特征中通过一个向量表示。对于出行场景而言,特定行为包括:下单数量、好评度等与服务相关的行为。从而,在基于第二特征确定响应策略时,不仅可以考虑真实的行为,还可以考虑时间的因素,使得对目标对象或交互指令发出者的评定更准确,进一步的,对响应策略的确定也更为准确。The second characteristic may be a characteristic determined based on historical data, including: a characteristic determined based on the historical data of the target object and a characteristic determined based on the historical data of the sender of the interactive instruction. In some embodiments, the type of the second feature can be determined according to the type of historical data. For example, if the historical data contains voice data, the second feature contains the voice feature of the historical data. Correspondingly, the extraction method of the second feature is the same as that of the first feature. The features are similar, so I won’t repeat them here. In some embodiments, time series behavior characteristics can also be extracted based on historical data, the historical data of the target object or the sender of the interactive instruction can be converted into sequence characteristics in the order of time, and one or more specific behaviors in the historical data can be converted into sequence characteristics. Each feature behavior is represented by a vector in the sequence feature. For travel scenarios, specific behaviors include service-related behaviors such as the number of orders placed, and the degree of praise. Therefore, when determining the response strategy based on the second feature, not only the real behavior can be considered, but also the time factor can be considered, so that the evaluation of the target object or the interactive instruction issuer is more accurate, and further, the determination of the response strategy is also more accurate. Is accurate.
在一些实施例中,关联特征可以通过向量表示。在一些实施例中,在提取了关联特征之后,可以对关联特征进行后处理,例如,归一化处理等。关于归一化处理的更多细节可以参见图20及其相关描述。In some embodiments, the associated features can be represented by vectors. In some embodiments, after the associated features are extracted, post-processing may be performed on the associated features, for example, normalization processing, etc. For more details about the normalization process, please refer to Figure 20 and its related descriptions.
步骤520,基于对关联特征进行处理,确定针对目标对象的响应策略,响应策略与响应内容、响应风格和响应情感中的至少一种相关。在一些实施例中,步骤520可以由确定模块420执行。Step 520: Determine a response strategy for the target object based on processing the associated features, where the response strategy is related to at least one of response content, response style, and response emotion. In some embodiments, step 520 may be performed by the determining module 420.
响应策略是指响应交互指令的方法和/或准则,响应策略与响应内容、响应风格和响应情感中的至少一种相关。The response strategy refers to a method and/or criterion for responding to interactive instructions, and the response strategy is related to at least one of response content, response style, and response emotion.
响应内容是指对交互指令响应的语义内容。在一些实施例中,可以通过不同的语言内容表达同一响应内容。例如,响应内容为“司机服务很好”,可以表达为“司机服务很好”、“司机服务很棒”、“司机服务真好”等。The response content refers to the semantic content of the response to the interactive command. In some embodiments, the same response content can be expressed in different language content. For example, the response content is "the driver's service is very good", which can be expressed as "the driver's service is very good", "the driver's service is great", "the driver's service is really good", etc.
响应风格是指对交互指令响应的程度,例如,夸奖对方时可以有:狠夸(强烈夸赞,夸奖程度较高)、正常夸、微夸(轻微夸赞,夸奖程度较低)等。The response style refers to the degree of response to interactive commands. For example, when complimenting the other party, there can be: strong praise (strong praise, high praise), normal praise, slight praise (slight praise, low praise), etc.
响应情感是指响应交互指令时带有的情绪或心情。响应情感可以包括失落、平静、热情、高兴、激动、喜悦、兴奋、悲伤、生气、暴躁痛苦或者激情等。Responding emotions refers to the emotions or moods associated with responding to interactive instructions. Response emotions can include loss, calm, enthusiasm, joy, excitement, joy, excitement, sadness, anger, irritability, pain, or passion.
在一些实施例中,可以通过模型对关联特征进行处理,确定针对目标对象的响应策略。模型由多层残差网络与多层全连接网络构成,多层残差网络由卷积神经网络构成;或由多层卷积神经网络与多层全连接网络构成。其中,模型可以是第一分类模型、第二分类模型和第三分类模型,具体参见后文。In some embodiments, the associated features can be processed through the model to determine the response strategy for the target object. The model is composed of a multi-layer residual network and a multi-layer fully connected network. The multi-layer residual network is composed of a convolutional neural network; or a multi-layer convolutional neural network and a multi-layer fully connected network. Among them, the models may be the first classification model, the second classification model, and the third classification model. For details, see the following text.
在一些实施例中,确定模块420可以对关联特征进行处理,确定指定话术相应的响应内容。例如,确定模块420可以基于第一分类模型对语音特征进行处理,确定语音数据中是否包含多个语种中任意一种的指定话术;若语音数据中包含指定话术,确定指定话术相应的响应内容。具体可以参见图6及其描述,此处不再赘述。In some embodiments, the determination module 420 may process the associated features and determine the response content corresponding to the specified speech. For example, the determining module 420 may process the voice features based on the first classification model, and determine whether the voice data contains a specified dialect in any one of the multiple languages; if the voice data contains a specified dialect, determine the corresponding corresponding to the specified dialect Response content. For details, refer to Fig. 6 and its description, which will not be repeated here.
在一些实施例中,确定模块420可以基于对关联特征进行处理,确定响应情感。例如,确定模块420可以基于第二分类模型对语音特征和文本特征中至少一个进行处理,确定语音数据的被响应情感;以及基于被响应情感,确定响应情感。具体可以参见图13及其描述,此处不再赘述。In some embodiments, the determining module 420 may determine the response emotion based on processing the associated features. For example, the determining module 420 may process at least one of the voice feature and the text feature based on the second classification model to determine the response emotion of the voice data; and based on the response emotion, determine the response emotion. For details, refer to FIG. 13 and its description, which will not be repeated here.
一些实施例中,确定模块420可以基于对关联特征进行处理,确定响应风格。例如,确定模块可以基于第三分类模型对第一特征和第二特征中的至少一个进行处理,确定响应风格。具体参见图15及其描述,此处不再赘述。In some embodiments, the determining module 420 may determine the response style based on processing the associated features. For example, the determining module may process at least one of the first feature and the second feature based on the third classification model to determine the response style. Refer to Figure 15 and its description for details, and will not be repeated here.
在一些实施例中,确定模块420可以基于对交互指令和历史数据进行处理,确定响应策略。例如,确定模块420可以基于响应策略模型对交互指令和历史数据中的至少一个进行处理,确定响应策略。在一些实施例中,响应策略模型的可以是单独的模型,也可以是第一分类模型、第二分类模型以及第三分类模型的任意组合。In some embodiments, the determination module 420 may determine a response strategy based on processing interaction instructions and historical data. For example, the determining module 420 may process at least one of the interaction instruction and historical data based on the response strategy model to determine the response strategy. In some embodiments, the response strategy model may be a separate model, or any combination of the first classification model, the second classification model, and the third classification model.
在一些实施例中,确定模块420可以获取当前天气、实时路况等信息。在一些实施例中,确定模块420可以根据当前天气、实时路况等信息调整响应策略。例如,天气恶劣时,可以适当增加响应策略(例如,响应情感)中,“安抚”情感的权重。又例如,天气晴好、路况顺畅时,可以适当增加响应策略(例如,响应情感)中,“开心”情绪的权重。In some embodiments, the determining module 420 can obtain information such as current weather and real-time road conditions. In some embodiments, the determining module 420 may adjust the response strategy according to information such as current weather and real-time road conditions. For example, when the weather is bad, the weight of "soothing" emotion in the response strategy (for example, responding to emotion) can be appropriately increased. For another example, when the weather is fine and the road conditions are smooth, the weight of the "happy" emotion in the response strategy (for example, responding to emotion) can be appropriately increased.
在一些实施例中,确定模块420可以将目标对象的交互指令、目标对象的历史数据、当前天气以及实时路况组成特征序列,并将组合特征序列输入基于RNN模型的嵌入模型,以获得指令表示向量。进一步地,将指令表示向量和服务平台的信息维度数据等特征输入响应策略预测模型,获得响应策略预测模型输出的响应策略。嵌入模型和响应策略预测模型可以通过联合训练的方式获得。In some embodiments, the determining module 420 may compose a feature sequence of interactive instructions of the target object, historical data of the target object, current weather, and real-time road conditions, and input the combined feature sequence into the embedded model based on the RNN model to obtain the instruction representation vector . Further, characteristics such as the instruction representation vector and the information dimension data of the service platform are input into the response strategy prediction model to obtain the response strategy output by the response strategy prediction model. The embedded model and the response strategy prediction model can be obtained through joint training.
其中,组合特征序列由若干时间点的特征组合值构成。每一时间点的特征组合值由该时间点的目标对象的交互指令、目标对象的历史数据、天气数据、实时路况数据组合形成,并乘以时间权重系数。时间权重系数依时间点的远近可以有所不同,距当前较近的时间点的权重系数可以较大。在一些实施例中,在形成特征组合值时,可以根据当前天气对实时路况做一变换,使得极端天气下的反映路况拥堵情况的特征值减小,以减弱特殊数据的影响。在一些实施例中,上述变换可以是:将实时路况的权重降为0.01。Among them, the combined feature sequence is composed of feature combination values at several time points. The characteristic combination value at each time point is formed by the combination of the target object's interactive instruction, the target object's historical data, weather data, and real-time road condition data at that time point, and is multiplied by the time weight coefficient. The time weight coefficient can be different depending on the distance of the time point, and the weight coefficient of the time point closer to the current time can be larger. In some embodiments, when forming the characteristic combination value, a transformation can be made to the real-time road conditions according to the current weather, so that the characteristic value reflecting the congestion of the road conditions under extreme weather is reduced, so as to reduce the influence of the special data. In some embodiments, the aforementioned transformation may be: reducing the weight of the real-time road condition to 0.01.
通过上述方式的组合,可以更好地反映出不同因素对于预测结果的影响,特别是这些因素之间的相互关系。例如,天气的影响是具有前后关联性的,实时路况的影响也是具有关联性的,基于RNN的处理,可以反映出前后时间点的关联关系,使响应策略更准确。Through the combination of the above methods, the influence of different factors on the forecast results can be better reflected, especially the interrelationship between these factors. For example, the impact of weather is related to the front and back, and the impact of real-time road conditions is also related. RNN-based processing can reflect the relationship between the front and back time points and make the response strategy more accurate.
步骤530,基于响应策略,确定针对目标对象的响应话术。在一些实施例中,步骤530 可以由响应模块430执行。 Step 530, based on the response strategy, determine the response language for the target object. In some embodiments, step 530 may be performed by the response module 430.
响应话术是设备或系统响应于交互指令,针对目标对象反馈的语言信息。可以理解的,响应话术是输出的具体语言,该语言中可以包含响应情感、响应风格和/或响应内容等信息。例如,播放充满激情(即,响应情感)地狠狠(即,响应风格)夸司机服务很不错(即,响应内容)的响应话术“司机太棒了!”。可以理解的,表达同一响应策略(例如,响应内容、响应风格或响应情感)的语言可以有一种或多种,即,响应策略与响应话术可以存在一对一或一对多的关系。Responsive speech is the language information that the device or system responds to the target object in response to the interactive command. It is understandable that response speech is a specific language output, and the language may contain information such as response emotion, response style, and/or response content. For example, a response phrase "The driver is great!" is played that is full of passion (that is, responding to emotions) (that is, responding style) and boasting that the driver's service is very good (that is, the response content). It is understandable that there can be one or more languages expressing the same response strategy (for example, response content, response style, or response emotion), that is, there can be a one-to-one or one-to-many relationship between response strategies and response words.
在一些实施例中,响应模块430可以基于响应策略确定响应话术。在一些实施例中,响应模块430可以根据确定的响应内容确定响应话术需要表达的内容,再根据响应情感和/或响应风格,确定响应内容具体的表达方式,包括是否加入语气词、程度词或其他体现情感和风格的词,或输出的语音语调等。In some embodiments, the response module 430 may determine response words based on the response strategy. In some embodiments, the response module 430 may determine the content that needs to be expressed in response to words according to the determined response content, and then determine the specific expression mode of the response content according to the response emotion and/or response style, including whether to add modal particles and degree words Or other words that embody emotions and styles, or output voice intonation, etc.
示例的,响应策略为夸奖的响应内容,则输出的响应话术为夸奖的语言,例如,“你不错”等。响应策略为狠夸的响应风格,则在响应内容中添加程度词“很”“非常”等,例如,“你很不错”。又例如,响应策略为喜悦的响应情感,则输出的响应话术可以是带一些表示喜悦的语气词,例如,“你很不错哦”等。For example, if the response strategy is praised response content, the output response words are praised language, for example, "you are good". The response strategy is an exaggerated response style, and the degree words "very" and "very" are added to the response content, for example, "you are very good". For another example, if the response strategy is the response emotion of joy, the output response words can be with some modal particles expressing joy, for example, "You are very good" and so on.
在一些实施例中,数据库中可以存储针对相同或不同的响应内容、响应情感和/或响应风格的预设话术,从而可以根据响应策略(响应内容、响应情感和/或响应风格),从数据库中对获取响应策略对应的响应话术。存储的预设话术可以是用户(例如,司机端用户或乘客端用户等)提前自定义录制好的,也可以是开发人员提前预设的。In some embodiments, the database may store preset words for the same or different response content, response emotion and/or response style, so that according to the response strategy (response content, response emotion and/or response style), from Response words corresponding to the response strategy obtained in the database. The stored preset words can be customized and recorded in advance by the user (for example, the user on the driver side or the user on the passenger side, etc.), or can be preset by the developer in advance.
在一些实施例中,还可以从公共平台(例如,维基百科等)上提取响应内容、响应情感和/或响应风格对应的语言或词,并生成响应话术。In some embodiments, the language or words corresponding to the response content, response emotion and/or response style can also be extracted from a public platform (for example, Wikipedia, etc.), and response words can be generated.
在一些实施例中,还可以通过模型或算法生成响应话术,例如,transformer、Bert模型等。例如,将词典和指定话术输入模型,输出响应话术。In some embodiments, the response words can also be generated through models or algorithms, for example, transformers, Bert models, and so on. For example, input the dictionary and designated words into the model, and output the response words.
如前所述,响应策略与响应话术可以存在一对多或一对一的关系。在一些实施例中,可以基于响应策略,从存储设备获取一个或多个预设话术,或者生成一个或多个预设话术。当预设话术的数目为多个时,需要在多个预设话术中确定出一个响应话术以输出。在一些实施例中,终端或响应模块430可以在多个预设响应话术中,按照预设规则自动选择一个预设话术作为响应话术,并将该响应话术输出。例如,终端可以随机选择一个预设话术作为响应话术。或者,终端可以将该用户或用户群体的使用频率最高的一个预设话术作为响应话术。用户群体可以为所有用户、所有乘客端、所有司机端、用户所在区域(例如,市、区或自定 义区域,如5公里范围内的圆形区域内等)内的所有用户等。As mentioned earlier, there can be a one-to-many or one-to-one relationship between response strategies and response words. In some embodiments, based on the response strategy, one or more preset words may be obtained from the storage device, or one or more preset words may be generated. When the number of preset words is multiple, it is necessary to determine a response word among the plurality of preset words to output. In some embodiments, the terminal or the response module 430 may automatically select a preset speech as the response speech according to a preset rule among a plurality of preset response speeches, and output the response speech. For example, the terminal can randomly select a preset speech as the response speech. Alternatively, the terminal may use the user or user group's most frequently used preset speech as the response speech. The user group can be all users, all passengers, all drivers, and all users in the user's area (for example, a city, district, or a custom area, such as a circular area within 5 kilometers, etc.).
在一些实施例中,响应模块430可以通过响应文本和/或响应语音的方式,将响应话术输出给目标对象。例如,将响应话术对应的文本转化为语音,并输出给目标对象。在一些实施例中,输出响应语音还是输出响应文字,可以根据实际场景来确定。例如,在输出响应话术时,若当前终端为司机端,且司机端当前处于车辆驾驶状态,则可以仅输出响应语音。此时,避免输出响应文字分散司机注意力,避免由此导致的驾车安全问题。此外,在该场景中,还可以同时输出响应语音和响应文字。又例如,在输出响应话术时,可以检测终端是否处于音频或视频播放状态;若是,则输出响应文字;反之,则可以输出响应文字与响应语音中的一种或多种。又例如,响应话术可以显示在预设的显示界面里,也可以显示在状态栏或通知栏,示例的,若司机端正处于车辆驾驶状态,则可以在当前的显示界面上输出响应话术;若当前终端正处于音视频播放状态,则可以在状态栏或通知栏中,小窗显示响应话术。In some embodiments, the response module 430 may output response words to the target object by responding to text and/or responding to voice. For example, the text corresponding to the response words is converted into speech and output to the target object. In some embodiments, whether to output the response voice or output the response text may be determined according to the actual scene. For example, when outputting response words, if the current terminal is the driver's end and the driver's end is currently in a vehicle driving state, only the response voice can be output. At this time, avoid outputting the response text to distract the driver, and avoid the driving safety problems caused by this. In addition, in this scenario, the response voice and response text can also be output at the same time. For another example, when outputting response words, it can be detected whether the terminal is in an audio or video playback state; if so, the response text can be output; otherwise, one or more of the response text and the response voice can be output. For another example, the response words can be displayed in the preset display interface, and can also be displayed in the status bar or notification bar. For example, if the driver is in the driving state of the vehicle, the response words can be output on the current display interface; If the current terminal is in the audio and video playback state, you can display response words in a small window in the status bar or notification bar.
在一些实施例中,响应语音和响应文字的语义可以相同,也可以不同,具体可以根据场景设置。例如,对于同一响应内容“夸司机服务好”,则响应语音可以为“司机师傅最阳光”,响应文字也可以为“司机师傅最阳光”,二者语义一致;响应文字也可以为“风里雨里,感谢不辞辛苦的你”。In some embodiments, the semantics of the response voice and the response text may be the same or different, and may be specifically set according to the scene. For example, for the same response content "praising the driver for good service", the response voice can be "the driver is the most sunny", the response text can also be "the driver is the most sunny", the semantics of the two are the same; the response text can also be "wind in the wind" In the rain, thank you for your hard work."
在一些实施例中,响应模块430还可以以其他方式将响应话术输出给目标对象。例如,可以通过图像或视频等作为响应话术输出给目标对象,如,制作表达响应话术的实际内容的视频或图片作为输出。In some embodiments, the response module 430 may also output the response words to the target object in other ways. For example, an image or video can be used as the response language to be output to the target object, for example, a video or picture expressing the actual content of the response language can be produced as the output.
如前所述,目标对象可以是当前终端的所属用户或通信用户。在一些实施例中,交互指令或语音数据可以是针对自身,也可以是针对通信对方用户。As mentioned above, the target object can be the user or communication user of the current terminal. In some embodiments, the interactive instruction or voice data may be directed to the user, or to the user of the communication partner.
以交互指令中包含夸奖用户的指定话术为例。具体的,以当前正进行通信的一对司机端和乘客端为例。关于指定话术参见图6及其相关描述。对司机端而言,若司机端发出的语音数据中包含有指定话术为“夸夸我吧”或者“夸夸司机”,则该指定话术是针对自身的;或者,若司机端的发出的语音数据中包含有指定话术“夸夸乘客”,则该指定话术是针对当前通信的对方用户的,也就是,针对乘客端的。反之,对乘客端而言,若乘客端发出的语音数据中包含有指定话术为“夸夸我吧”或者“夸夸乘客”,则该指定话术是针对自身的;或者,若乘客端的发出的语音数据中包含有指定话术“夸夸司机”,则该指定话术是针对当前通信的对方用户的,也就是,针对司机端的。Take, for example, that the interactive instruction contains the specified words to praise the user. Specifically, take a pair of driver and passenger terminals that are currently communicating as an example. Refer to Figure 6 and related descriptions for the designated words. For the driver’s end, if the voice data sent by the driver includes a designated speech as "praise me" or "praise the driver", then the designated speech is for itself; or, if the driver’s end If the voice data contains the designated speech technique "Praise the Passenger", the designated speech technique is aimed at the opposite user of the current communication, that is, aimed at the passenger side. Conversely, for the passenger side, if the voice data sent by the passenger side contains the designated speech technique as "praise me" or "praise the passenger", then the designated speech technique is for itself; or, if the voice data from the passenger end The sent voice data contains the designated speech technique "boasting the driver", then the designated speech technique is for the opposite user of the current communication, that is, for the driver's side.
基于交互指令针对的目标对象不同,在执行响应话术的输出时,则可以分别向自身的终端(即,当前终端)或/和向对方用户的终端输出响应话术。换言之,当交互指令针对自身 时,直接在当前终端输出响应话术;或者,当交互指令针对当前通信的对方用户时,向对方用户的终端输出响应话术,还可以同时像当前终端输出相应话术。Based on the different target objects targeted by the interactive instruction, when the output of the response language is executed, the response language may be output to the own terminal (ie, the current terminal) or/and to the terminal of the other user respectively. In other words, when the interactive instruction is directed to itself, the response language is directly output on the current terminal; or, when the interactive instruction is directed to the opposite user of the current communication, the response language is output to the opposite user’s terminal, and the corresponding language can also be output like the current terminal. Surgery.
图6是根据本申请一些实施例所示的人机交互的方法的示例性流程图。在一些实施例中,流程600可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 600 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤610,基于针对目标对象的交互指令,提取目标对象的关联特征。在一些实施例中,步骤610可以由提取模块410执行。Step 610: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 610 may be performed by the extraction module 410.
在一些实施例中,目标对象的交互指令包括语音数据。目标对象的关联特征可以包括语音数据的语音特征。如图5所述,语音数据可以是多种语种中的任意一种或多种的组合的语音数据,语音特征可以是多语种语音数据的语音特征。在一些实施例中,目标对象的关联特征还可以包括:交互指令对应的文本数据的文本特征。关于语音特征和文本特征的提取方式可以参见步骤510及其相关描述,此处不再赘述。In some embodiments, the interaction instruction of the target object includes voice data. The associated feature of the target object may include the voice feature of the voice data. As shown in FIG. 5, the voice data can be any one or a combination of multiple languages, and the voice feature can be the voice feature of the multilingual voice data. In some embodiments, the associated feature of the target object may further include: the text feature of the text data corresponding to the interactive instruction. For the manner of extracting voice features and text features, please refer to step 510 and related descriptions, which will not be repeated here.
步骤620,基于第一分类模型对语音特征进行处理,确定语音数据中是否包含多个语种中任意一种的指定话术。在一些实施例中,步骤620可以由确定模块420执行。Step 620: Process the voice features based on the first classification model, and determine whether the voice data contains a specified language of any one of multiple languages. In some embodiments, step 620 may be performed by the determining module 420.
指定话术可以是包含(但不仅仅限定于)特定词或句子的语言。指定话术可以用于确定响应话术的语义信息。指定话术可以是响应话术需要包含的特征词或近似语义词,或者是响应话术针对的对象等。指定话术可以根据实际需求确定。例如,对于夸奖服务提供方的场景,指定话术可以包括“夸夸”、“夸奖”、“鼓励”、“奖励”或“不错”等,指定话术还可以包括“乘客”“司机”“服务提供方”“服务请求方”“我”等。The designated language can be a language that contains (but is not limited to) specific words or sentences. The designated speech can be used to determine the semantic information in response to the speech. The designated words can be characteristic words or approximate semantic words that need to be included in the response words, or the target of the response words, etc. The designated speech technique can be determined according to actual needs. For example, for the scene of complimenting the service provider, the designated language can include "praise", "comrade", "encourage", "reward" or "good", etc., and the designated language can also include "passenger", "driver", and so on. "Service provider", "service requester", "I", etc.
第一分类模型是指一种由计算设备实现的计算模型,第一分类模型是确定是否含有指定话术的模型。在一些实施例中,第一分类模型可以对语音特征进行处理,输出分类识别结果,以确定是否含有指定话术。第一分类模型可以对文本特征进行处理,输出分类识别结果,以确定是否含有指定话术。第一分类模型可以对语音特征和文本特征进行处理,确定是否含有指定话术。关于分类识别结果的更多细节参见步骤1730。在一些实施例中,第一分类模型若可以对语音特征和文本特征进行处理,第一分类模型可以包括两个子分类模型,分别为对语音特征进行处理的分类模型,和对文本特征进行处理的分类模型。第一分类模型也可以是一个模型,用于处理文本特征和语音特征,语音特征和文本特征可以通过相同的形式(例如,向量或归一化后的向量)输入第一分类模型。第一分类模型可以通过端对端的训练得到。The first classification model refers to a calculation model implemented by a computing device, and the first classification model is a model that determines whether or not it contains a specified language. In some embodiments, the first classification model may process the voice features and output the classification recognition result to determine whether it contains the specified words. The first classification model can process the text features and output the classification recognition results to determine whether the specified words are contained. The first classification model can process the voice features and text features to determine whether it contains the specified words. See step 1730 for more details about the classification and recognition results. In some embodiments, if the first classification model can process speech features and text features, the first classification model can include two sub-classification models, which are respectively a classification model for processing speech features and a classification model for processing text features. Classification model. The first classification model may also be a model for processing text features and voice features, and the voice features and text features can be input into the first classification model in the same form (for example, a vector or a normalized vector). The first classification model can be obtained through end-to-end training.
在一些实施例中,第一分类模型可以与语种是一对一或一对多的关系。例如,所有的语种可以对应同一个第一分类模型,又例如,不同的语种对应不同的第一分类模型等。In some embodiments, the first classification model may have a one-to-one or one-to-many relationship with the language. For example, all languages can correspond to the same first classification model, and for example, different languages correspond to different first classification models.
在一些实施例中,当不同的语种对应不同的第一分类模型时,在将语音特征输入第一 分类模型之前,可以对语音数据进行识别(例如,通过语音解码器等),识别语音数据的语种,进一步地,将语音特征输入到与识别的语种对应的第一分类模型中,确定语音数据中是否包含指定话术。In some embodiments, when different languages correspond to different first classification models, the speech data can be recognized (for example, by a speech decoder, etc.) before the speech features are input into the first classification model, and the speech data can be recognized. Language, further, input the voice feature into the first classification model corresponding to the recognized language, and determine whether the voice data contains the specified words.
在一些实施例中,多语种可以对应同一个第一分类模型,即该第一分类模型可以处理多语种的语音特征,判断语音数据中是否包含多语种指定话术中的任意一种指定话术。In some embodiments, multiple languages can correspond to the same first classification model, that is, the first classification model can process the voice features of multiple languages, and determine whether the voice data contains any one of the multilingual specified words. .
第一分类模型或子分类模型为机器学习模型。在一些实施例中,第一分类模型或子分类模型可以是分类模型或回归模型。第一分类模型或子分类模型的类型包括但不限于神经网络(NN)、卷积神经网络(CNN)、深度神经网络(DNN)、神经网络(NN)、卷积神经网络(CNN)、循环神经网络(RNN)等或其任意组合,例如,第二分类模型或子分类模型可以为卷积神经网络和深度神经网络组合形成的模型。The first classification model or sub-classification model is a machine learning model. In some embodiments, the first classification model or sub-classification model may be a classification model or a regression model. The types of the first classification model or sub-classification model include but are not limited to neural network (NN), convolutional neural network (CNN), deep neural network (DNN), neural network (NN), convolutional neural network (CNN), loop A neural network (RNN) or any combination thereof, for example, the second classification model or sub-classification model may be a model formed by a combination of a convolutional neural network and a deep neural network.
在一些实施例中,第一分类模型或子分类模型可以由多层卷积神经网络CNN与多层全连接网络构成。第一分类模型或子分类模型可以由多层CNN残差网络与多层全连接网络构成。例如,第一分类模型可以为5层CNN残差网络和3层全连接网络。In some embodiments, the first classification model or sub-classification model may be composed of a multi-layer convolutional neural network CNN and a multi-layer fully connected network. The first classification model or sub-classification model can be composed of a multi-layer CNN residual network and a multi-layer fully connected network. For example, the first classification model may be a 5-layer CNN residual network and a 3-layer fully connected network.
在一些实施例中,第一分类模型可以于CNN网络构建残差网络来提取语音数据的隐层特征,然后,再利用多层全连接网络对残差网络输出的隐层特征进行映射,从而经柔性最大值传输函数(softmax)分类输出得到多分类识别结果。In some embodiments, the first classification model can construct a residual network on the CNN network to extract the hidden layer features of the voice data, and then use the multi-layer fully connected network to map the hidden layer features output by the residual network. The softmax transfer function (softmax) classification output obtains the multi-class recognition result.
在一些实施例中,所使用的第一分类模型中,相对于单一的全连接网络,能够使用CNN网络结构抽取特征,从而,能够在保证识别精度的同时,有效控制网络参数规模不会过大,避免第一分类模型规模巨大,难以在终端侧有效部署的问题。In some embodiments, in the first classification model used, compared to a single fully connected network, the CNN network structure can be used to extract features, so that while ensuring the recognition accuracy, the scale of network parameters can be effectively controlled not to be too large. , To avoid the problem that the scale of the first classification model is huge and it is difficult to effectively deploy on the terminal side.
在一些实施例中,用于处理文本特征的子分类模型还可以是文本处理模型,例如,Bert模型等。In some embodiments, the sub-classification model used to process text features may also be a text processing model, for example, a Bert model.
在一些实施例中,第一分类模型可以提前进行离线训练得到,并被部署在终端设备上。第一分类模型可以提前进行离线训练得到,并被部署被存储在存储设备中或被部署在云端上,终端设备对存储设备或云端具有访问权限。第一分类模型还可以实时的基于当前数据,进行在线训练得到。关于第一分类模型的训练具体参见图7其相关描述,此处不再赘述。In some embodiments, the first classification model may be obtained by offline training in advance and deployed on the terminal device. The first classification model can be obtained by offline training in advance, and is deployed and stored in a storage device or deployed on the cloud. The terminal device has access rights to the storage device or the cloud. The first classification model can also be obtained by online training based on current data in real time. For details about the training of the first classification model, refer to the related description in FIG. 7, which will not be repeated here.
在一些实施例中,确定模块420还可以以其他方式确定语音数据中是否包含多个语种中任意一种的指定话术。例如,可以通过规则对文本特征进行处理,确定是否含有指定话术。又例如,融合(例如,加权求和、加权平均等)基于第一分类模型对语音特征进行处理确定的结果,以及基于规则对文本特征进行处理确定的结果,最终确定交互指令中是否包含指定话术。In some embodiments, the determining module 420 may also determine whether the voice data contains a specified language of any one of multiple languages in other ways. For example, you can process text features through rules to determine whether it contains specified words. For another example, fusion (eg, weighted summation, weighted average, etc.) is based on the result of processing and determining the voice feature based on the first classification model, and the result of processing and determining the text feature based on the rule, and finally determines whether the interactive instruction contains the specified words. Surgery.
步骤630,响应于语音数据中包含指定话术,确定指定话术相应的响应内容。在一些实施例中,步骤630可以由响应模块430执行。Step 630, in response to the voice data containing the designated speech, determine the response content corresponding to the designated speech. In some embodiments, step 630 may be performed by the response module 430.
在一些实施例中,当第一分类模型的识别结果指示语音数据中包含指定话术时,确定指定话术对应的响应内容,并进一步基于响应内容确定响应话术并输出。In some embodiments, when the recognition result of the first classification model indicates that the voice data contains the specified speech, the response content corresponding to the specified speech is determined, and the response speech is further determined and output based on the response content.
如前所述,响应内容可以是语义信息,当语音数据中包含指定话术时,响应内容可以是指定话术,或者是与指定话术语义相似的内容。As mentioned above, the response content can be semantic information. When the voice data contains a specified phrase, the response content can be a specified phrase or content similar to the meaning of the specified phrase.
如前所述,表达同一响应内容,可以通过多种语言表达,即,对应多个响应话术。可以理解的,指定话术可以对应一个或多个响应话术,即指定话术及其对应的响应话术可以存在一对一或一对多的关系。不同的指定话术,其对应的响应话术可以相同或不同。在一些实施中,指定话术及其对应的响应话术可以存储在数据库或存储设备中,可以从存储设备中获取指定话术对应的响应话术。As mentioned earlier, the same response content can be expressed in multiple languages, that is, corresponding to multiple response words. It is understandable that the designated speech technique may correspond to one or more response words, that is, the specified speech technique and its corresponding response words may have a one-to-one or one-to-many relationship. For different designated words, the corresponding response words can be the same or different. In some implementations, the specified words and their corresponding response words may be stored in a database or a storage device, and the response words corresponding to the specified words may be obtained from the storage device.
在一些实施例中,基于响应内容之后,还可以判断响应情感或响应风格,从而结合响应内容、响应情感或响应风格输出响应话术。关于响应情感和响应风格的确定参见图8、图10及其相关描述。关于获取响应话术的更多内容可以参见步骤530及其相关描述。In some embodiments, based on the response content, the response emotion or response style can also be determined, so as to output response words in combination with the response content, response emotion or response style. For the determination of response emotion and response style, see Figure 8, Figure 10 and related descriptions. For more information about obtaining response words, please refer to step 530 and its related descriptions.
图7是根据本申请一些实施例所示的第一分类模型训练的示例性流程图。在一些实施例中,流程700可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 7 is an exemplary flowchart of the first classification model training according to some embodiments of the present application. In some embodiments, the process 700 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤710,获取多个第一训练样本。在一些实施例中,步骤710可以由获取模块440执行。Step 710: Obtain multiple first training samples. In some embodiments, step 710 may be performed by the acquisition module 440.
在一些实施例中,可以基于多个第一训练样本训练第一分类模型。每个第一训练样本可以包括语音数据,即第一样本语音数据。在一些实施例中,第一样本语音数据可以是多语种语音数据。例如,第一样本语音数据可以是英文语音数据、日文语音数据、中文语音数据、韩文语音数据等,不作穷举。In some embodiments, the first classification model may be trained based on a plurality of first training samples. Each first training sample may include voice data, that is, first sample voice data. In some embodiments, the first sample voice data may be multilingual voice data. For example, the first sample voice data may be English voice data, Japanese voice data, Chinese voice data, Korean voice data, etc., which are not exhaustive.
多个第一训练样本包括正样本与负样本。正样本为与指定话术相关的第一样本语音数据。例如,正样本为携带有指定话术的第一样本语音数据,或携带有与指定话术相似语义的第一样本语音数据。负样本为与指定话术不相关的第一样本语音数据,例如,不包含指定话术或不包含与指定话术语义相同的话术等。The plurality of first training samples include positive samples and negative samples. The positive sample is the first sample of speech data related to the specified speech. For example, the positive sample is the first sample speech data carrying the specified speech, or the first sample speech data carrying the semantic similar to the specified speech. The negative sample is the first sample of speech data that is not related to the specified speech, for example, does not contain the specified speech or does not contain the same meaning as the specified speech.
在一些实施例中,可以对正负样本进行标识,例如,正样本为1,负样本为0。训练得到第一分类模型输出的结果可以为0-1之间的概率,或是否为包含指定话术的分类结果。关于第一分类模型输出的结果(即,分类识别结果)的更多细节可以参见图17及其相关描述。In some embodiments, positive and negative samples can be identified, for example, positive samples are 1 and negative samples are 0. The output result of the first classification model obtained by training can be a probability between 0-1, or whether it is a classification result that includes a specified language. For more details about the results output by the first classification model (ie, classification recognition results), please refer to FIG. 17 and related descriptions.
在一些实施例中,可以从存储设备或数据库中获取第一训练样本。还可以从服务平台、 客户端等获取历史数据作为第一训练样本。In some embodiments, the first training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the first training sample.
在一些实施例中,可以先获取第一样本语音数据的样本话术识别结果,样本话术识别结果用于指示第一样本语音数据中是否与指定话术相关。进一步的,可以根据样本话术识别结果,对第一样本语音数据进行标识或正负样本分类。In some embodiments, a sample speech recognition result of the first sample speech data may be obtained first, and the sample speech recognition result is used to indicate whether the first sample speech data is related to a specified speech. Further, the first sample voice data can be identified or positive and negative samples can be classified according to the result of sample speech recognition.
在一些实施例中,样本话术识别结果可以是样本文本识别结果,也可以是人工标注得到的标签,也可以二者结合。换言之,可以将第一样本语音数据转换为第一样本文本数据,并在第一样本文本数据中针对指定话术进行文本识别,得到第一样本文本数据的样本话术识别结果;和/或,接收人工基于第一样本语音数据和/或第一样本文本数据确定的样本话术识别结果。In some embodiments, the sample speech recognition result may be a sample text recognition result, or a label obtained by manual annotation, or a combination of the two. In other words, it is possible to convert the first sample speech data into the first sample text data, and perform text recognition for the specified speech in the first sample text data to obtain the sample speech recognition result of the first sample text data; And/or, receiving a sample speech recognition result manually determined based on the first sample voice data and/or the first sample text data.
在一些实施例中,可以基于语音解码器(又称为语音转化器),将第一样本语音数据转化为第一样本文本数据,进一步的,通过对第一样本文本数据进行识别或分析,确定第一样本文本数据是否与指定话术相关,从而确定正负样本或第一样本语音数据的标识。例如,通过关键词匹配或文本相似度等方式确定是否与指定话术相关。示例性的,通过文本匹配计算(例如,欧式距离等)第一样本文本数据与指定话术文本的文本相似度,若文本相似度达到(大于、大于或等于的情况)预设的相似度阈值,则第一样本文本数据为正样本;反之,第一样本文本数据为负样本。相似度阈值无特别限定,例如,可以为80%。在一些实施例中,还可以计算第一样本文本数据的字准,并将字准作为计算文本相似度的一个评价标准。In some embodiments, the first sample voice data can be converted into the first sample text data based on a voice decoder (also called a voice converter), and further, the first sample text data can be recognized or Analyze to determine whether the first sample text data is related to the specified speech, so as to determine the identity of the positive and negative samples or the first sample voice data. For example, it is determined whether it is related to the specified words by means of keyword matching or text similarity. Exemplarily, the text similarity between the first sample text data and the specified verbal text is calculated by text matching (for example, Euclidean distance, etc.), if the text similarity reaches (greater than, greater than or equal to) the preset similarity Threshold, the first sample text data is a positive sample; otherwise, the first sample text data is a negative sample. The similarity threshold is not particularly limited, and may be 80%, for example. In some embodiments, the character standard of the first sample text data can also be calculated, and the character standard can be used as an evaluation criterion for calculating text similarity.
如前所述,第一样本语音数据可以是多语种的语音数据。在一些实施例中,可以基于与第一样本语音数据的语种对应的语音转化器,将第一样本语音数据转化为第一样本文本数据,进一步确定是否包含指定话术,即确定第一样本语音数据为正样本还是负样本。As mentioned above, the first sample voice data may be multilingual voice data. In some embodiments, the first sample voice data may be converted into the first sample text data based on the voice converter corresponding to the language of the first sample voice data, and it is further determined whether the specified words are included, that is, the first sample voice data is determined. Whether the voice data is a positive sample or a negative sample.
在一些实施例中,若无法获取到第一样本文本数据,语音解码器的解码准确度不能满足预设的识别要求,或者通过文本相似度或关键词匹配确定正负样本的准确率较低时,还可以结合人工标注的方式来实现分类。例如,可以将基于文本相似度输出的结果、未成功识别或识别准确率较低的第一样本语音数据,在屏幕上输出,以便于用户针对自动分类结果进行校验或更正(标注),之后,将人工标注的结果作为标识。In some embodiments, if the first sample text data cannot be obtained, the decoding accuracy of the speech decoder cannot meet the preset recognition requirements, or the accuracy of determining positive and negative samples through text similarity or keyword matching is low At the same time, it can also be combined with manual labeling to achieve classification. For example, output results based on text similarity, unsuccessful recognition or first sample speech data with low recognition accuracy can be output on the screen so that users can verify or correct (label) the automatic classification results. After that, the result of manual labeling is used as the identification.
在一些实施例中,可以对正样本和负样本的比例进行控制。例如,可以将正样本与负样本之比控制为7:3。如此,在执行该步骤处理时,还可能涉及到将正样本与负样本进行筛选的可能,以使得正负样本的比例在预设的比例范围内。In some embodiments, the ratio of positive samples and negative samples can be controlled. For example, the ratio of positive samples to negative samples can be controlled to 7:3. In this way, when performing the processing of this step, the possibility of screening the positive sample and the negative sample may also be involved, so that the ratio of the positive sample and the negative sample is within a preset ratio range.
如前所述,第一分类模型可以处理文本特征。在一些实施例中,第一训练样本还可以包含第一样本文本数据。相应的,与指定话术相关的为正样本,与指定话术无关的为负样本。As mentioned earlier, the first classification model can handle text features. In some embodiments, the first training sample may also include first sample text data. Correspondingly, the positive samples are related to the specified speech, and the negative samples are not related to the specified speech.
步骤720,基于所述多个第一训练样本对初始第一分类模型进行训练,得到所述第一分类模型。在一些实施例中,步骤720可以由获取模块440执行。Step 720: Train an initial first classification model based on the multiple first training samples to obtain the first classification model. In some embodiments, step 720 may be performed by the acquisition module 440.
在一些实施例中,第一分类模型可以基于第一训练样本,通过各种方法进行训练,对初始第一分类模型的参数进行更新,得到训练好的第一分类模型。训练方式包括但不限于:基于梯度下降法、最小二乘法、变学习率、交叉熵、随机梯度以及交叉验证方式进行等。可以理解的,经正样本与负样本训练完成后,得到的第一分类模型与初始第一分类模型具有相同的网络结构。In some embodiments, the first classification model may be trained through various methods based on the first training sample, and the parameters of the initial first classification model are updated to obtain the trained first classification model. Training methods include but are not limited to: based on gradient descent method, least square method, variable learning rate, cross entropy, stochastic gradient, and cross-validation methods. It is understandable that after the training of the positive sample and the negative sample is completed, the obtained first classification model has the same network structure as the initial first classification model.
在一些实施例中,可以提取正样本与负样本的语音特征,进而,利用正负样本的语音特征来进行模型训练。应当理解,针对正负样本进行语音特征提取的方式与前述步骤610相同,此处不赘述。In some embodiments, the voice features of the positive and negative samples can be extracted, and then the voice features of the positive and negative samples are used for model training. It should be understood that the manner of performing speech feature extraction for the positive and negative samples is the same as that of the foregoing step 610, and will not be repeated here.
在一些实施例中,训练时可以将正样本与负样本数据按照一定的比例进行混合训练,如,7:3的比例。在一些实施例中,可以采取整句训练的方式来实现。例如,第一样本语音数据为一个完整的句子的语音数据等。In some embodiments, during training, the positive sample and the negative sample data can be mixed training according to a certain ratio, for example, a ratio of 7:3. In some embodiments, this can be achieved by means of whole sentence training. For example, the first sample voice data is the voice data of a complete sentence and so on.
在一些实施例中,当训练的第一分类模型满足预设条件时,训练结束。其中,预设条件可以是损失函数结果收敛或小于预设阈值,还可以是训练周期达到阈值等。In some embodiments, when the trained first classification model meets a preset condition, the training ends. Wherein, the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.
在一些实施例中,可以对初始第一分类模型中的参数进行初始赋值,然后,利用正负样本来进行训练,结合对正负样本输出结果的准确率,来对初始第一分类模型中的参数进行调整,循环多次训练处理,最终得到分类结果的准确率较高的参数,并将该参数作为第一分类模型的参数。In some embodiments, the parameters in the initial first classification model can be initially assigned, and then the positive and negative samples are used for training, combined with the accuracy of the output results of the positive and negative samples, to compare the parameters in the initial first classification model. The parameters are adjusted, and the training process is repeated multiple times, and finally a parameter with a higher accuracy of the classification result is obtained, and the parameter is used as the parameter of the first classification model.
在一些实施例中,可以构建测试集,并利用测试集来对第一分类模型的分类结果进行测试。具体的,可以通过计算预测结果的准确率与误识别率,来评价第一分类模型的真实性能,并基于该真实性能,对第一分类模型的参数进行调整,或决定进一步训练处理。In some embodiments, a test set can be constructed and used to test the classification result of the first classification model. Specifically, the real performance of the first classification model can be evaluated by calculating the accuracy rate and the false recognition rate of the prediction result, and based on the real performance, the parameters of the first classification model can be adjusted, or further training processing can be determined.
如前所述,第一分类模型可以与语种存在一对一或多对多的关系。当为一对一的关系时,通过第一分类模型对应语种的第一训练样本进行训练。当为一对多的关系时,通过第一分类模型对应多个语种的第一训练样本进行训练。As mentioned above, the first classification model can have a one-to-one or many-to-many relationship with a language. In the case of a one-to-one relationship, training is performed by the first training sample of the corresponding language of the first classification model. In the case of a one-to-many relationship, training is performed through the first training samples corresponding to multiple languages of the first classification model.
在一些实施例中,与第二分类模型和第三分类模型类似,获取模块440可以对第一分类模型进行验证、测试和更新,具体参见步骤920、1120及其相关描述。In some embodiments, similar to the second classification model and the third classification model, the acquisition module 440 can verify, test, and update the first classification model. For details, refer to steps 920, 1120 and related descriptions.
图8是根据本申请一些实施例所示的人机交互的方法的示例性流程图。在一些实施例中,流程800可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 800 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤810,基于针对目标对象的交互指令,提取目标对象的关联特征。在一些实施例 中,步骤810可以由提取模块410执行。Step 810: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 810 may be performed by the extraction module 410.
在一些实施例中,交互指令可以包括语音数据,关联特征可以包括语音数据的语音特征、交互指令对应的文本数据的文本特征中的至少一个。关于关联特征及其提取参见步骤510和610,此处不再赘述。In some embodiments, the interaction instruction may include voice data, and the associated feature may include at least one of a voice feature of the voice data and a text feature of the text data corresponding to the interaction instruction. Refer to steps 510 and 610 for the associated features and their extraction, which will not be repeated here.
步骤820,基于第二分类模型对语音特征和文本特征中至少一个进行处理,确定语音数据的被响应情感。在一些实施例中,步骤820可以由确定模块420执行。Step 820: Process at least one of the voice feature and the text feature based on the second classification model, and determine the respondent emotion of the voice data. In some embodiments, step 820 may be performed by the determining module 420.
被响应情感是指交互指令带有的情感。被响应情感的类型可以是包括:失落、平静、热情、激情、喜悦、悲伤、痛苦、欣慰、兴奋等。被响应情感可以基于对交互指令的处理得到。The respondent emotion refers to the emotion of the interactive command. The types of emotions that are responded to may include: loss, calm, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc. The responding emotion can be obtained based on the processing of the interactive instruction.
第二分类模型是指一种由计算设备实现的计算模型,第二分类模型是确定交互指令的被响应情感的模型。第二分类模型可以是二分类或分类模型。第二分类模型的类型可以参见第一分类模型,即步骤620及其相关描述。第一分类模型与第二分类模型类型可以相同或不同。The second classification model refers to a computing model implemented by a computing device, and the second classification model is a model for determining the response emotion of an interactive command. The second classification model can be a binary classification or a classification model. For the type of the second classification model, refer to the first classification model, that is, step 620 and related descriptions. The first classification model and the second classification model type can be the same or different.
在一些实施例中,第二分类模型可以对语音特征进行处理,确定被响应情感。第二分类模型可以对文本特征进行处理,确定被响应情感。第二分类模型可以对语音特征和文本特征进行处理,确定被响应情感。可以理解的,与第一分类模型类似,若第二分类模型可以对语音特征和文本特征进行处理,第二分类模型可以包括两个子分类模型,分别为对语音特征进行处理的分类模型,和对文本特征进行处理的分类模型。第二分类模型可以通过端对端的训练得到。关于基于第二分类模型输出的结果确定被响应情感参见图20或21相关描述。In some embodiments, the second classification model can process speech features to determine the responding emotion. The second classification model can process the text features and determine the response sentiment. The second classification model can process voice features and text features to determine the response emotion. It is understandable that, similar to the first classification model, if the second classification model can process speech features and text features, the second classification model can include two sub-classification models, which are the classification models that process speech features, and the A classification model for processing text features. The second classification model can be obtained through end-to-end training. For determining the response emotion based on the output result of the second classification model, refer to the related description in FIG. 20 or 21.
第二分类模型可以通过训练获取。关于第二分类模型的训练具体参见图9其相关描述,此处不再赘述。关于第二分类模型的部署可以与第一分类模型类似,参见步骤720及其相关描述。The second classification model can be obtained through training. Regarding the training of the second classification model, refer to the related description in FIG. 9, and details are not repeated here. Regarding the deployment of the second classification model, it may be similar to the first classification model, see step 720 and related descriptions.
在一些实施例中,可以对文本特征进行处理得到的第二情感,以及对语音特征处理得到的第一情感,基于第一情感和第二情感确定被响应情感。In some embodiments, the second emotion obtained by processing the text feature and the first emotion obtained by processing the voice feature may determine the response emotion based on the first emotion and the second emotion.
第一情感是指基于语音特征确定的情感,在一些实施例中,第一情感可以基于第二分类模型对语音特征的处理得到。可以理解的,第二分类模型在基于语音特征确定第一情感时,除了基于语音数据中语言的语义信息外,可以结合语音数据中的语调信息或语气信息等,使得确定的情感更加准确。The first emotion refers to the emotion determined based on the voice feature. In some embodiments, the first emotion may be obtained based on the processing of the voice feature by the second classification model. It is understandable that when the second classification model determines the first emotion based on the speech features, in addition to the semantic information of the language in the speech data, the intonation information or tone information in the speech data can be combined to make the determined emotion more accurate.
第二情感是指基于文本特征确定的情感。第二情感的类型可以是与第一情感的类型相同或不同。文本特征包括关键词特征,在一些实施例中,可以基于关键词特征确定第二情感。The second emotion refers to the emotion determined based on the text feature. The type of the second emotion may be the same as or different from the type of the first emotion. The text features include keyword features. In some embodiments, the second emotion may be determined based on the keyword features.
第一情感和第二情感的类型可以是包括但不限于失落、平静、热情、激情、喜悦、悲伤、痛苦、欣慰、兴奋等。The types of the first emotion and the second emotion may include, but are not limited to, loss, peace, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc.
在一些实施例中,关键词特征包括情感关联词,确定模块420可以基于情感关联词,确定第二情感。In some embodiments, the keyword features include emotion-related words, and the determining module 420 may determine the second emotion based on the emotion-related words.
情感关联词是指可以指示交互指令中代表情感的词。在一些实施例中,情感关联词可以包括但不限于:语气词与程度词中的一种或多种。例如,语气词可以包括:“请”、“吧”、“呀”、“吗”等,程度词可以包括但不限于:“非常”、“很”、“狠”等,对此不作穷举。Emotion related words refer to words that can indicate emotions in interactive instructions. In some embodiments, emotion-related words may include, but are not limited to: one or more of modal particles and degree words. For example, modal particles can include: "please", "ba", "ah", "Ma", etc., degree words can include, but are not limited to: "very", "very", "relentless", etc., which are not exhaustive .
在一些实施例中,可以识别文本数据中的情感关联词,并根据情感关联词,确定第二情感。例如,可以基于预设规则,基于情感关联词的识别结果,将与该识别结果对应的情感作为第二情感。不同的情感关联词可以与情感存在映射关系,该映射关系可以是人工预设并存储在存储设备中。例如,若预设了“呀”对应的情感为“喜悦”,“吗”对应的情感为“难过”等。以夸夸场景为例,若将用户发出的语音数据转换为文本数据后,其内容为“可以夸夸我吗”,第二情感为:难过;若内容为“夸夸我呀”,第二情感为:喜悦。又例如,若预设了同时有程度词和“呀”时对应的情感为“兴奋”,同时有程度词和“吗”对应的情感为“悲伤”。In some embodiments, emotion-related words in the text data can be identified, and the second emotion can be determined according to the emotion-related words. For example, based on preset rules and based on the recognition result of the emotion-related words, the emotion corresponding to the recognition result may be used as the second emotion. Different emotion-related words may have a mapping relationship with emotions, and the mapping relationship may be manually preset and stored in a storage device. For example, if the emotion corresponding to "Ah" is preset as "joy", the emotion corresponding to "Ma" is "sad" and so on. Take the exaggeration scene as an example. If the voice data sent by the user is converted into text data, the content is "Can you praise me?", the second emotion is: sad; if the content is "praise me", second The emotion is: joy. For another example, if it is preset that the corresponding emotion when there are both a degree word and "Ah" is "excited", and the emotion corresponding to a degree word and "Ma" is also "sad".
在一些实施例中,可以为各情感关联词预设情感分值,即,情感关联词对应不同情感的分值。可以识别文本数据中的所有情感关联词,并将这些情感关联词的情感分值进行运算或加权运算,基于运算得到的值确定第二情感。In some embodiments, the sentiment score may be preset for each sentiment related word, that is, the scores of the sentiment related words corresponding to different emotions. All emotion-related words in the text data can be identified, and the emotion scores of these emotion-related words can be calculated or weighted, and the second emotion can be determined based on the calculated value.
在一些实施例中,还可以以其他方式确定第二情感。例如,可以将文本特征输入文本处理模型(例如,Bert等),确定第二情感。In some embodiments, the second emotion may also be determined in other ways. For example, text features can be input into a text processing model (for example, Bert, etc.) to determine the second emotion.
在一些实施例中,可以对第一情感与第二情感进行数值表示,则可以对二者的数值进行运算或加权运算(例如,加权求平均)得到加权值,然后,基于加权值确定被响应情感。例如,不同的情感可以对应不同数值或数值范围,例如,激动为2、喜悦为1、痛苦为-1等。该数值可以预先根据需求或规则设定。In some embodiments, the first emotion and the second emotion can be expressed numerically, and then the two values can be calculated or weighted (for example, weighted averaging) to obtain a weighted value, and then, based on the weighted value, it is determined that the response is emotion. For example, different emotions can correspond to different numerical values or numerical ranges, for example, excitement is 2, joy is 1, pain is -1, and so on. The value can be set in advance according to requirements or rules.
在一些实施例中,第二分类模型可以输出第一情感对应的概率值,对文本处理可以得到第二情感对应的概率值,可以基于对概率值进行加权处理(例如,与权重相乘或相加等),将概率值最大的情感作为被响应情感。In some embodiments, the second classification model can output the probability value corresponding to the first emotion, and the text processing can obtain the probability value corresponding to the second emotion, which can be based on weighting the probability value (for example, multiplying with the weight or comparing it with the weight). Plus, etc.), the emotion with the highest probability value is regarded as the responding emotion.
在一些实施例中,当第一情感与第二情感相同或相似时,将第一情感或第二情感作为被响应情感。当第一情感与第二情感不相似(例如,完全相反等)时,可以取第一情感作为被响应情感,或者请求人工确定被响应情感。In some embodiments, when the first emotion and the second emotion are the same or similar, the first emotion or the second emotion is regarded as the responding emotion. When the first emotion is not similar to the second emotion (for example, completely opposite, etc.), the first emotion can be taken as the respondent emotion, or the respondent emotion can be manually determined.
步骤830,基于被响应情感,确定响应情感。在一些实施例中,步骤820可以由确定模块420执行。Step 830: Determine the response emotion based on the response emotion. In some embodiments, step 820 may be performed by the determining module 420.
在一些实施例中,被响应情感与响应情感可以相似、相同或相反等。例如,被响应情感为喜悦,响应情感可以为喜悦或激动。又例如,被响应情感可以为失落,响应情感可以为平静或喜悦等。在一些实施例中,响应情感可以仅包含正面情感,以针对用户的负面情感进行安抚。In some embodiments, the responding emotion and the responding emotion may be similar, the same, or opposite. For example, the responding emotion is joy, and the responding emotion can be joy or excitement. For another example, the responding emotion can be loss, and the responding emotion can be calm or joy. In some embodiments, the response emotion may only include positive emotions to appease the user's negative emotions.
在一些实施例中,可以预设被响应情感和响应情感之间的对应关系。被响应情感和响应情感可以时一对一或多对多的关系。预设的方式可以基于规则确定,也可以根据历史反馈数据确定或优化。例如,根据用户对响应话术的反馈信息等。In some embodiments, the corresponding relationship between the response emotion and the response emotion may be preset. The response emotion and response emotion can be one-to-one or many-to-many relationships. The preset method can be determined based on rules, and can also be determined or optimized based on historical feedback data. For example, according to the user's feedback information on the response words.
在一些实施例中,被响应情感与响应情感之间的对应关系可以预先存储在存储设备中。例如,存储在终端的存储器中,或终端可读的其他存储位置中,例如云端,对此不作限定。基于被响应情感,可以从存储设备中获取响应情感。In some embodiments, the corresponding relationship between the response emotion and the response emotion may be stored in the storage device in advance. For example, it is stored in the memory of the terminal, or in other storage locations readable by the terminal, such as the cloud, which is not limited. Based on the response emotion, the response emotion can be obtained from the storage device.
在一些实施例中,响应模块430可以基于响应情感可以确定输出响应话术并输出。关于基于响应情感确定响应话术可参见步骤530及其相关描述。In some embodiments, the response module 430 may determine and output response words based on the response emotion. Refer to step 530 and related descriptions for determining response words based on response emotions.
图9是根据本申请一些实施例所示的第二分类模型训练的示例性流程图。在一些实施例中,流程900可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application. In some embodiments, the process 900 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤910,获取多个第二训练样本。在一些实施例中,步骤910可以由获取模块440执行。Step 910: Obtain multiple second training samples. In some embodiments, step 910 may be performed by the acquisition module 440.
在一些实施例中,第二训练样本可以包括:第二样本语音数据与第二样本文本数据中的一种或多种,及对应的情感标签。例如,第二训练样本可以包括第二样本语音数据及其对应的情感标签。又例如,第二训练样本可以包括第二样本语音数据、第二样本文本数据及对应的情感标签。情感标签为第二样本语音数据或第二样本文本数据对应的情感。在一些施例中,情感标签的类型可以为one hot标签。In some embodiments, the second training sample may include: one or more of the second sample speech data and the second sample text data, and the corresponding emotion label. For example, the second training sample may include the second sample speech data and the corresponding emotion label. For another example, the second training sample may include the second sample speech data, the second sample text data, and the corresponding emotion label. The emotion label is the emotion corresponding to the second sample speech data or the second sample text data. In some embodiments, the type of emotion tag may be one hot tag.
第二样本文本数据为与第二样本语音数据对应的文本数据,例如,第二样本文本数据为对第二样本语音数据进行文本识别得到。又例如,第二样本语音数据基于对第二样本文本数据进行机转化或人工朗读得到。在一些实施例中,第二样本语音数据可以来自于真实的线上语音数据;和/或,也可以来自于定制数据。例如,基于第二样本文本数据内容,通过人工朗读的方式,生成第二样本语音数据。例如,可以制定第二样本文本内容和相应的语气标准,然后采用人工有感情的朗读,获得不同语气情绪的第二样本语音数据。The second sample text data is text data corresponding to the second sample voice data. For example, the second sample text data is obtained by performing text recognition on the second sample voice data. For another example, the second sample voice data is obtained based on machine conversion or manual reading of the second sample text data. In some embodiments, the second sample voice data may come from real online voice data; and/or, it may also come from customized data. For example, based on the content of the second sample text data, the second sample voice data is generated by manual reading. For example, the text content of the second sample and the corresponding tone standard can be formulated, and then artificial and emotional reading is used to obtain the second sample voice data of different tone and emotion.
在一些实施例中,第二样本语音数据或第二样本文本数据对应的文本长度一般不宜过 长(例如,字或词的个输小于阈值,该阈值可以为20、10等)。因为过长的文本会导致语音过长,进一步可能会导致语气的波动性较大,以及会导致环境噪声更加随机复杂。In some embodiments, the length of the text corresponding to the second sample voice data or the second sample text data is generally not too long (for example, the individual input of a word or word is less than a threshold value, which may be 20, 10, etc.). Because too long text will cause the speech to be too long, and further may cause greater fluctuations in the tone, and cause the environmental noise to be more random and complicated.
在一些实施例中,可以从存储设备或数据库中获取第一训练样本。还可以从服务平台、客户端等获取历史数据作为第一训练样本。In some embodiments, the first training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the first training sample.
步骤920,基于多个第二训练样本对初始第二分类模型进行训练,得到第二分类模型。在一些实施例中,步骤920可以由获取模块440执行。Step 920: Train the initial second classification model based on the multiple second training samples to obtain the second classification model. In some embodiments, step 920 may be performed by the acquisition module 440.
在一些实施例中,第二分类模型可以基于第二训练样本,通过各种方法进行训练,对初始第二分类模型的参数进行更新,得到训练好的第二分类模型。训练方式可以与第一分类模型类似,参见步骤710及其相关描述。可以理解的,经第二训练样本训练完成后,得到的第二分类模型与初始第二分类模型具有相同的网络结构。In some embodiments, the second classification model may be trained through various methods based on the second training sample, and the parameters of the initial second classification model are updated to obtain the trained second classification model. The training method may be similar to the first classification model, see step 710 and related descriptions. It is understandable that after the training of the second training sample is completed, the obtained second classification model has the same network structure as the initial second classification model.
在一些实施例中,可以对第二样本语音数据或/和第二样本文本数据进行特征提取,并基于提取到的语音特征和/或文本特征进行第二分类模型的训练。In some embodiments, feature extraction can be performed on the second sample voice data or/and the second sample text data, and the second classification model can be trained based on the extracted voice features and/or text features.
在一些实施例中,若第二分类模型用于处理语音特征,则用于训练的第二训练样本包括第二样本语音数据及其情感标签。若第二分类模型用于处理文本特征,则用于训练的第二训练样本包括第二样本文本数据及其情感标签。若第二分类模型用于处理语音特征和文本特征,则用于训练的第二训练样本包括第二样本语音数据、第二样本文本数据及其情感标签,且通过端到端的训练方式。In some embodiments, if the second classification model is used to process speech features, the second training sample used for training includes the second sample speech data and its emotion labels. If the second classification model is used to process text features, the second training sample used for training includes the second sample text data and emotion labels. If the second classification model is used to process voice features and text features, the second training sample used for training includes the second sample voice data, the second sample text data and the emotional label, and the training is done through an end-to-end manner.
在一些实施例中,在提取到样本特征之后,可以采取整句不定长的训练方式,将一句话提取的特征作为分类器的输入,获取第二分类模型输出的情感,然后,利用输出情感与情感标签之间的差异情况,来调整第二分类模型的参数,最终得到分类准确率较高的第二分类模型。In some embodiments, after the sample features are extracted, a training method of indefinite length of the whole sentence can be adopted, and the features extracted from one sentence are used as the input of the classifier to obtain the emotion output by the second classification model, and then use the output emotion and Adjust the parameters of the second classification model based on the difference between the sentiment labels, and finally obtain the second classification model with higher classification accuracy.
在一些实施例中,还可以对当前训练模型进行模型验证。模型验证分为测试环境搭建,以及模型测试两个过程。测试环境搭建,用于检验当前模型是否能够在不同终端,例如不同品牌的手机上,是否能够顺利搭建以及正常运行。因此,测试时需要在按真实场景进行离线测试。In some embodiments, model verification can also be performed on the current training model. Model verification is divided into two processes: test environment construction and model testing. The test environment is built to check whether the current model can be successfully built and run normally on different terminals, such as different brands of mobile phones. Therefore, the test needs to be tested offline according to the real scene.
测试过程可以包括但不限于以下两种测试方法。第一种测试方法可以为真人实时多次测试,然后统计识别结果的准确率,该测试方法的好处是可以更好的模拟真实场景下的用户行为,测试的可信度更高。第二种测试方法为,真人在真实场景下录制测试集。测试集可根据需要录制一个或者多个,可重复利用,成本更低,并且可以在一定程度上保证测试的客观有效性。The test process can include but is not limited to the following two test methods. The first test method can test multiple times for real people in real time, and then count the accuracy of the recognition results. The advantage of this test method is that it can better simulate user behavior in real scenarios, and the reliability of the test is higher. The second test method is that a real person records a test set in a real scene. One or more test sets can be recorded as needed, which can be reused, with lower cost, and the objective validity of the test can be guaranteed to a certain extent.
图10是根据本申请一些实施例所示的人机交互的方法的示例性流程图。在一些实施例中,流程1000可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 1000 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤1010,基于针对目标对象的交互指令,提取所述目标对象的关联特征。在一些实施例中,步骤1010可以由提取模块410执行。Step 1010: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 1010 may be performed by the extraction module 410.
在一些实施例中,关联特征包括交互指令对应的第一特征和目标对象的历史数据对应的第二特征中的至少一个。关于关联特征及其提取参见步骤510,此处不再赘述。In some embodiments, the associated feature includes at least one of the first feature corresponding to the interactive instruction and the second feature corresponding to the historical data of the target object. Regarding the associated feature and its extraction, refer to step 510, which will not be repeated here.
步骤1020,基于第三分类模型对第一特征和第二特征中的至少一个进行处理,确定针对目标对象的响应风格。在一些实施例中,步骤1020可以由确定模块420执行。Step 1020: Process at least one of the first feature and the second feature based on the third classification model, and determine a response style for the target object. In some embodiments, step 1020 may be performed by the determining module 420.
第三分类模型是指一种由计算设备实现的计算模型,第三分类模型是确定目标对象的响应风格的模型。第三分类模型可以是多分类或二分类模型。关于第三分类模型的类型可以参见第一分类模型,不再赘述。第一分类模型可以与第三分类模型类型相同或不同。第三分类模型可以通过训练获得。关于第三分类模型的训练具体参见图11其相关描述。第三分类模型的部署可以参见第一分类模型的部署,参见步骤620及其相关描述。The third classification model refers to a calculation model implemented by a computing device, and the third classification model is a model that determines the response style of the target object. The third classification model can be a multi-class or two-class model. For the type of the third classification model, please refer to the first classification model, which will not be repeated here. The first classification model can be the same or different from the third classification model. The third classification model can be obtained through training. For the training of the third classification model, refer to the related description in FIG. 11 for details. For the deployment of the third classification model, refer to the deployment of the first classification model, and refer to step 620 and related descriptions.
在一些实施例中,处理设备可以基于第三分类模型对第一特征和第二特征中的至少一个进行处理,确定响应风格。例如,将第一特征和第二特征输入第三分类模型,输出响应风格,可选的,第三分类模型在处理时,可以对第一特征和第二特征分别赋予权重(例如,第一特征的权重大于第二特征的权重)。又例如,将第一特征或第二特征输入第三分类模型,输出响应风格。关于第一特征和第二特征具体参见步骤510。In some embodiments, the processing device may process at least one of the first feature and the second feature based on the third classification model to determine the response style. For example, the first feature and the second feature are input to the third classification model, and the response style is output. Optionally, when the third classification model is processed, the first feature and the second feature can be weighted respectively (for example, the first feature The weight of is greater than the weight of the second feature). For another example, the first feature or the second feature is input to the third classification model, and the response style is output. For the first feature and the second feature, refer to step 510 for details.
在一些实施例中,处理设备可以对交互指令对应的文本数据的文本特征进行处理,以及基于第三分类模型对第一特征和第二特征中的至少一个进行处理,确定响应风格。In some embodiments, the processing device may process the text feature of the text data corresponding to the interactive instruction, and process at least one of the first feature and the second feature based on the third classification model to determine the response style.
第一风格为基于第一特征或/和第二特征确定的风格。例如,基于第三分类模型对第一特征和第二特征中的至少一个进行处理,确定的风格。The first style is a style determined based on the first feature or/and the second feature. For example, processing at least one of the first feature and the second feature based on the third classification model to determine the style.
第二风格为基于文本特征处理得到的风格。在一些实施例中,第二风格可以与第一风格相同或不同,包括但不限于:狠夸、正常夸、微夸等。The second style is a style obtained based on text feature processing. In some embodiments, the second style may be the same or different from the first style, including but not limited to: exaggerated, normal exaggerated, slightly exaggerated, and the like.
在一些实施例中,可以基于模型、算法或规则对文本特征进行处理,得到第二风格。例如,通过识别文本中是否含有与风格相关的关键词(如,非常、很等),进一步基于该关键词确定第二风格。又例如,通过文本处理模型(例如,Bert、DNN等)对文本特征进行处理,输出第二风格。In some embodiments, the text features can be processed based on models, algorithms, or rules to obtain the second style. For example, by identifying whether the text contains keywords related to style (such as very, very, etc.), the second style is further determined based on the keywords. For another example, the text features are processed through a text processing model (for example, Bert, DNN, etc.), and the second style is output.
在一些实施例中,可以基于第一风格和第二风格中的至少一种,确定所述响应风格。 例如,可以将第一风格或第二风格中的任意一个确定为响应风格。又例如,若第一风格和第二风格之间不同时,可以将第一风格作为响应风格。In some embodiments, the response style may be determined based on at least one of the first style and the second style. For example, any one of the first style or the second style may be determined as the response style. For another example, if the first style is different from the second style, the first style can be used as the response style.
在一些实施例中,可以对第一风格和第二风格进行融合处理,确定响应风格。例如,不同的风格可以对应不同分值(如,狠夸为3、正常夸为2等),可以对第一风格和第二风格设置不同的权重,融合两个风格的权重和分值,基于得到的融合结果确定响应风格。还可以通过其他融合方式确定响应风格,本实施例不做限制。In some embodiments, fusion processing may be performed on the first style and the second style to determine the response style. For example, different styles can correspond to different scores (for example, exaggerated to 3, normal exaggerated to 2, etc.), you can set different weights for the first style and the second style, and merge the weights and scores of the two styles, based on The resulting fusion determines the response style. The response style can also be determined by other fusion methods, which is not limited in this embodiment.
图11是根据本申请一些实施例所示的训练第三分类模型的示例性流程图。一些实施例中,流程1100可以由处理设备(例如,处理设备112、处理器210或中央处理器340)执行。Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application. In some embodiments, the process 1100 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).
步骤1110,获取多个第三训练样本。在一些实施例中,步骤1110可以由获取模块440执行。Step 1110: Obtain multiple third training samples. In some embodiments, step 1110 may be performed by the obtaining module 440.
第三训练样本可以来源于真实数据(如,历史数据),也可以来自于制定数据。例如,开发人员可以制定样本数据,并输入终端中,以便于终端进行第三分类模型的训练。The third training sample can be derived from real data (for example, historical data), or can be derived from formulated data. For example, the developer can formulate sample data and input it into the terminal so that the terminal can train the third classification model.
在一些实施例中,第三训练样本包括针对样本目标对象的样本交互指令、样本目标对象的样本历史数据、以及对应的风格标签。样本历史数据与前述历史数据的方式及内容均可以相同,此处不赘述。In some embodiments, the third training sample includes sample interaction instructions for the sample target object, sample history data of the sample target object, and corresponding style labels. The method and content of the sample historical data and the aforementioned historical data can be the same, and will not be repeated here.
风格标签代表针对样本目标对象的样本响应风格。样本风格标签具体可以为onehot标签。在一些实施例中,风格标签可以基于样本目标对象的信誉数据和反馈数据确定。具体的,确定风格标签包括:获取样本目标对象针对样本响应话术的反馈数据,样本响应话术基于样本交互指令确定;获取样本目标对象的信誉数据;基于信誉数据和/或反馈数据,确定风格标签。The style label represents the sample response style for the sample target object. The sample style label may be a onehot label. In some embodiments, the style label may be determined based on the reputation data and feedback data of the sample target object. Specifically, determining the style label includes: obtaining feedback data of the sample target object’s response to the sample, which is determined based on the sample interaction instruction; obtaining the reputation data of the sample target object; and determining the style based on the reputation data and/or feedback data label.
反馈数据是指与样本目标对象对于样本响应话术的反应情况有关的数据,如反应情况包括但不限于正面(或正向)、负面(或负向)、积极、消极等。为了叙述清楚,可以用反应情况来直接表征反馈数据。例如,若样本目标对象无反馈,则可将其反馈数据默认为正向。若反馈的信息是积极或正向评价,则反馈数据为正面或积极等。Feedback data refers to the data related to the response of the sample target to the sample's response words. For example, the response includes, but is not limited to, positive (or positive), negative (or negative), positive, negative, etc. For the sake of clarity, the response can be used to directly characterize the feedback data. For example, if the sample target has no feedback, the feedback data can be set to positive by default. If the feedback information is positive or positive evaluation, the feedback data is positive or positive, etc.
信誉数据是指与样本目标对象的信用情况有关的数据。信誉数据可以基于历史数据确定。信誉数据可以具体表现为信誉分,此处对信誉分的计算方式不作赘述。示例性的,当信誉分达到(大于,或者,大于或等于)80分时,样本目标对象为高信誉度用户,其反馈的信息的可靠性较高,这也能够给开发人员以参考,辅助开发人员对样本数据进行人工标注。Reputation data refers to data related to the credit status of the sample target object. Reputation data can be determined based on historical data. The reputation data can be specifically expressed as a reputation score, and the calculation method of the reputation score is not repeated here. Exemplarily, when the reputation score reaches (greater than, or, greater than or equal to) 80 points, the target object of the sample is a user with a high reputation, and the reliability of the feedback information is high, which can also give developers a reference and assistance The developer manually annotates the sample data.
在一些实施例中,可以将信誉数据和/或反馈数据与响应风格建立对应关系,基于该对 应关系,确定风格标签。还可以通过模型对信誉数据和反馈数据进行处理,确定风格标签,其中,模型可以是DNN模型、CNN模型、RNN模型等。In some embodiments, the reputation data and/or feedback data may be associated with the response style, and based on the corresponding relationship, the style label may be determined. The reputation data and feedback data can also be processed through the model to determine the style label, where the model can be a DNN model, a CNN model, an RNN model, etc.
在一些实施例中,还可以基于人工对信誉数据和反馈数据的评价确定风格标签。例如,终端可以获取样本目标对象针对历史响应话术的反馈数据,然后,输出样本目标对象的信誉数据与反馈数据,并接收针对信誉数据与反馈数据的人工评价数据,人工评价数据用于指示风格标签。在该实施例中,虽然也是基于开发人员人工标注得到风格标签,但这种人工标注是基于终端输出的反馈数据与信誉数据来实现的,有利于辅助开发人员快速完成标注,尽可能降低人工花费的时间和成本。In some embodiments, the style label may also be determined based on manual evaluation of reputation data and feedback data. For example, the terminal can obtain feedback data of the sample target object for historical response speech, and then output the reputation data and feedback data of the sample target object, and receive manual evaluation data for the reputation data and feedback data. The manual evaluation data is used to indicate style label. In this embodiment, although the style label is also obtained based on the developer's manual labeling, this kind of manual labeling is implemented based on the feedback data and reputation data output by the terminal, which is beneficial to assist the developer to complete the labeling quickly and reduce the labor cost as much as possible. Time and cost.
在一些实施例中,可以从存储设备或数据库中获取第三训练样本。还可以从服务平台、客户端等获取历史数据作为第三训练样本。In some embodiments, the third training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the third training sample.
步骤1120,基于所述多个第三训练样本对初始第三分类模型进行训练,得到第三分类模型。在一些实施例中,步骤1120可以由获取模块440执行。Step 1120: Train an initial third classification model based on the multiple third training samples to obtain a third classification model. In some embodiments, step 1120 may be performed by the acquisition module 440.
在一些实施例中,第三分类模型可以基于第三训练样本,通过各种方法进行训练,对初始第三分类模型的参数进行更新,得到训练好的第三分类模型。训练方式包括但不限于:基于梯度下降法、最小二乘法、交叉熵计算损失、交叉验证,变学习率等。可以理解的,经第三训练样本训练完成后,得到的第三分类模型与初始第三分类模型具有相同的网络结构。In some embodiments, the third classification model may be trained through various methods based on the third training sample to update the parameters of the initial third classification model to obtain a trained third classification model. Training methods include but are not limited to: calculating loss based on gradient descent method, least square method, cross entropy, cross validation, variable learning rate, etc. It is understandable that after the training of the third training sample is completed, the obtained third classification model has the same network structure as the initial third classification model.
在一些实施例中,可以提取第三训练样本的特征(例如,语音特征、文本特征),进而利用第三训练样本的特征来进行模型训练。In some embodiments, the features of the third training sample (for example, voice features, text features) can be extracted, and then the features of the third training sample can be used for model training.
在一些实施例中,当训练的第三分类模型满足预设条件时,训练结束。其中,预设条件可以是损失函数结果收敛或小于预设阈值,还可以是训练周期达到阈值等。In some embodiments, when the trained third classification model satisfies a preset condition, the training ends. Wherein, the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.
在一些实施例中,使用端到端的训练方式,将基于第三训练样本提取到的特征作为输入,输出风格识别结果。然后,利用输出的风格识别结果与风格标签之间的差异情况,来调整初始第三分类模型的参数,最终得到分类准确率较高的第三分类模型。In some embodiments, an end-to-end training method is used, and the feature extracted based on the third training sample is used as input, and the style recognition result is output. Then, the difference between the output style recognition result and the style label is used to adjust the parameters of the initial third classification model, and finally a third classification model with higher classification accuracy is obtained.
除此之外,在一些实施例中,还可以利用实时的数据,对第三分类模型进行更新。以图5或10所示实施例为例,终端在输出响应话术之后,还可以获取针对响应话术的操作信息,从而,利用响应话术与操作信息,对第三分类模型进行更新。其中,操作信息可以为目标对象对响应话术进行评价或反馈的操作信息,这些操作信息也可以作为第三训练样本,来对第三分类模型进行实时的模型更新。In addition, in some embodiments, real-time data can also be used to update the third classification model. Taking the embodiment shown in FIG. 5 or 10 as an example, after outputting the response words, the terminal may also obtain operation information for the response words, so that the third classification model is updated by using the response words and the operation information. Among them, the operation information may be the operation information that the target object evaluates or feeds back on the response speech, and the operation information may also be used as a third training sample to update the third classification model in real time.
在一些实施例中,还可以对第三分类模型进行测试和验证,与第二分类模型类似,不再赘述。In some embodiments, the third classification model can also be tested and verified, which is similar to the second classification model and will not be repeated.
图12是根据本申请一些实施例所示的人机交互的示意图。如前所述,针对目标对象的交互指令的发出者可以是司机通过司机端发出,目标对象可以是司机。如图12所示,以司机端用户进行自夸的场景为例。如图12A所示,司机端用户可以在打车APP的司机端显示界面中,点击功能控件1201进入夸夸界面,那么,终端可以显示如图12B所示界面。图12B为夸夸功能的显示界面,在该显示界面上,司机端用户可以发出语音,相应地,终端则采集实时的语音数据,也就是接收到交互指令。之后,终端采集到语音数据后,可以确定采集到的语音数据是否包含指定话术。那么,若识别出来自于司机端用户的实时的语音数据中包含“夸夸司机”或“夸夸我吧”中的一个,则可以在终端中显示如图12C所示的显示界面。如图12C所示,在当前的显示界面中显示针对“夸夸我吧”的响应话术1203,具体为:“风里雨里,感谢不辞辛苦来接我”。Fig. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As mentioned above, the issuer of the interactive instruction for the target object can be the driver issued by the driver, and the target object can be the driver. As shown in Figure 12, a scenario where a user on the driver side brags is taken as an example. As shown in FIG. 12A, the driver-side user can click the function control 1201 in the driver-side display interface of the taxi APP to enter the exaggeration interface, then the terminal can display the interface as shown in FIG. 12B. Fig. 12B is a display interface with exaggerated function. On the display interface, the user at the driver end can make a voice. Accordingly, the terminal collects real-time voice data, that is, receives an interactive instruction. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver's end user includes one of "praise the driver" or "praise me", the display interface as shown in FIG. 12C can be displayed on the terminal. As shown in Figure 12C, the response word 1203 for "praising me" is displayed on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up".
除此之外,在图12B所示显示界面中,司机端用户还可以点击夸夸控件1202,以触发夸夸功能,进而显示如图12C所示界面,不作赘述。在图12A所示的显示界面中,功能控件1201中还可以对司机端新接收到的夸夸进行提示。In addition, in the display interface shown in FIG. 12B, the driver-side user can also click the boast control 1202 to trigger the boast function, and then display the interface as shown in FIG. 12C, which will not be described in detail. In the display interface shown in FIG. 12A, the function control 1201 can also prompt the driver's newly received exaggeration.
图13是根据本申请一些实施例所示的人机交互的示意图。如图13所所示,以司机端用户进行自夸的场景为例。如图13A所示,司机端用户可以在打车APP的司机端显示界面中,点击功能控件1301进入夸夸界面,那么,终端可以显示如图13B所示界面。图13B为夸夸功能的显示界面,在该显示界面上,司机端用户可以发出语音,相应地,终端则采集实时的语音数据“夸夸我吧”(或者“夸夸司机”),也就接收到交互指令。之后,终端采集到语音数据后,可以确定采集到的语音数据是否包含指定话术。那么,若识别出来自于司机端用户的实时的语音数据中包含“夸夸司机”或“夸夸我吧”中的一个,则可以在终端中显示如图13C所示的显示界面。如图13C所示,在当前的显示界面中显示针对“夸夸我吧”的响应话术1303,具体为:“司机师傅最阳光,最热心,最善良,最知冷知热!”。与图12类似,司机端用户还可以点击夸夸控件1302,以触发夸夸功能,进而显示如图13C所示界面。Fig. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 13, a scenario where a user on the driver side brags is taken as an example. As shown in FIG. 13A, the driver-side user can click the function control 1301 in the driver-side display interface of the taxi APP to enter the exaggeration interface, and then the terminal can display the interface as shown in FIG. 13B. Figure 13B is the display interface of the exaggeration function. On this display interface, the driver's end user can make a voice, and accordingly, the terminal collects real-time voice data "praise me" (or "praise the driver"). An interactive instruction was received. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver-side user includes one of "praise the driver" or "praise me", the display interface as shown in FIG. 13C can be displayed on the terminal. As shown in Figure 13C, the response word 1303 for "praise me" is displayed on the current display interface, specifically: "The driver is the sunniest, enthusiastic, kind, and knows the cold and knows the hot!". Similar to FIG. 12, the driver-side user can also click the boast control 1302 to trigger the boast function, and then display the interface as shown in FIG. 13C.
对比图12与图13可知,两个图分别对应的响应风格不同。如图12所示,终端采集到语音数据后,可以确定司机端用户(目标对象)喜好的响应风格为正常夸,并据此确定具备该响应风格的响应话术。如图12C所示,终端在当前的显示界面中显示针对“夸夸司机”或“夸夸我吧”的响应话术1203,具体为:“风里雨里,感谢不辞辛苦来接我”。如图13所示,终端采集到语音数据后,确定司机端用户(目标对象)喜好的响应风格为狠夸,并据此确定具备该响应风格的响应话术。如图13C所示,终端在当前的显示界面中显示针对“夸夸我吧”的响应话术1303,具体为:“司机师傅最阳光,最热心,最善良,最知冷知热!”。Comparing Fig. 12 with Fig. 13, it can be seen that the two pictures correspond to different response styles. As shown in Figure 12, after the terminal collects the voice data, it can be determined that the response style preferred by the driver's user (target object) is normal exaggeration, and based on this, the response language with the response style can be determined. As shown in Figure 12C, the terminal displays response words 1203 for "praising the driver" or "praising me" on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up". As shown in Figure 13, after the terminal collects the voice data, it determines that the response style preferred by the driver's user (target object) is exaggerated, and accordingly determines the response style with the response style. As shown in Figure 13C, the terminal displays 1303 response words for "praise me" on the current display interface, specifically: "The driver is the sunniest, most enthusiastic, kind, and knows the cold and knows the hot!".
由此可见,针对不同的目标对象,基于其历史数据,可以得到各自喜好的响应风格,终端可以基于不同的响应风格来事件不同程度的夸赞(响应)。除自夸之外,还可以对对方用户进行夸赞,此处不赘述。以及,用户还可以具备修改响应话术的权限。It can be seen that, for different target objects, based on their historical data, each preferred response style can be obtained, and the terminal can respond to different degrees of praise (response) based on different response styles. In addition to boasting, you can also praise the other user, so I won’t repeat them here. And, the user can also have the authority to modify the response speech.
图14是根据本申请一些实施例所示的人机交互的示意图。如图14所所示,以乘客端用户对司机端进行夸夸的场景为例。如图14A为用户端与司机端的通信界面,在该通信界面上,乘客端用户可以点击语音切换控件1401,来触发语音输入功能。此时,终端显示如图14B界面,在该显示界面上,若用户按住语音输入控件1402,终端就可以采集实时的语音数据,也即接收交互指令。之后,终端采集到语音数据后,可以确定采集到的语音数据是否包含指定话术。那么,若识别出来自于用户的实时的语音数据中包含“夸夸司机”,则可以在终端中显示如图14C所示的显示界面。如图14C所示,在当前的通信界面中,用户端向司机端发送响应话术1403,具体为:“司机师傅最阳光,最热心,最善良,最知冷知热!”。相对应的,对司机端而言,则可以提示用户收到了来自于乘客端的夸夸,例如,在图13A所示界面中的功能控件1301中进行提示,也可以在通知栏或状态栏进行提示。Fig. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 14, a scenario where the user on the passenger end exaggerates the driver end as an example. Figure 14A shows the communication interface between the user terminal and the driver terminal. On the communication interface, the passenger terminal user can click on the voice switch control 1401 to trigger the voice input function. At this time, the terminal displays the interface shown in Figure 14B. On the display interface, if the user presses down the voice input control 1402, the terminal can collect real-time voice data, that is, receive interactive instructions. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the user contains "boasting drivers", the display interface shown in FIG. 14C can be displayed on the terminal. As shown in Figure 14C, in the current communication interface, the user terminal sends response words 1403 to the driver terminal, specifically: "The driver is the most sunny, enthusiastic, kind, and knows the cold and knows the hot!". Correspondingly, for the driver side, it can prompt the user to receive a boast from the passenger side, for example, in the function control 1301 in the interface shown in Figure 13A, or in the notification bar or status bar. .
除此之外,在图14所示的场景中,如图14A所示,乘客端用户还可以点击该显示界面上的夸夸控件1404,来触发夸夸功能。此时,用户点击夸夸控件1404进行夸夸时,可以进入语音采集步骤,采用如图12或13所示的方式来实现夸夸。In addition, in the scene shown in FIG. 14, as shown in FIG. 14A, the passenger-side user can also click the boast control 1404 on the display interface to trigger the boast function. At this time, when the user clicks the exaggeration control 1404 to exaggerate, the user can enter the voice collection step, and the exaggeration can be realized in the manner shown in FIG. 12 or 13.
图15是根据本申请一些实施例所示的人机交互的示意图。如图15所示,也可以直接进入夸夸界面。如图15A所示的通信界面与图14A所示通信界面一致。乘客端用户可以点击夸夸控件1404,此时,终端显示图15B所示界面。在该界面上,终端确定要进行夸夸,可直接确定针对司机端用户的响应话术。此时,若乘客端用户点击响应话术的发送控件1405,则进入图15C所示界面,用户端向司机端发送响应话术1403,具体为:“司机师傅最阳光,最热心,最善良,最知冷知热!”。Fig. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 15, you can also directly enter the boast interface. The communication interface shown in FIG. 15A is the same as the communication interface shown in FIG. 14A. The user on the passenger terminal can click the boast control 1404. At this time, the terminal displays the interface shown in FIG. 15B. On this interface, the terminal is determined to exaggerate, and can directly determine the response words for the driver's end user. At this time, if the user on the passenger end clicks on the send control 1405 of the response words, it will enter the interface shown in Figure 15C, and the user end will send the response words 1403 to the driver end, specifically: "The driver is the sunniest, most enthusiastic, and kind," Know the cold and the hot best!".
图16是根据本申请一些实施例所示的人机交互的示意图。如图16所示,用户还可以具备修改响应话术的权限。如图16A所示的显示界面与图15B所示的显示界面相同。此时,在图16A所示界面上,终端当前确定的响应话术为“司机师傅最阳光,最热心,最善良,最知冷知热!”。若乘客端用户对该响应话术不满意,则可以点击话术切换控件1601,以对响应话术进行切换。此时,终端显示图16B所示控件。如图16B所示,经乘客端用户操作之后,当前确定的响应话术为“司机师傅最阳光,是最可靠的人”。如此,实现对响应话术的切换。之后,用户点击该显示界面上的发送控件1405,终端即可将该响应话术发送至司机端。Fig. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 16, the user can also have the authority to modify the response skills. The display interface shown in FIG. 16A is the same as the display interface shown in FIG. 15B. At this time, on the interface shown in Figure 16A, the currently determined response language of the terminal is "The driver is the sunniest, most enthusiastic, kind-hearted, and knows cold and hot!". If the passenger terminal user is not satisfied with the response language, he can click the language switching control 1601 to switch the response language. At this time, the terminal displays the controls shown in FIG. 16B. As shown in FIG. 16B, after the operation of the passenger terminal user, the currently determined response language is "the driver is the sunniest and most reliable person". In this way, the switching of response speech is realized. After that, the user clicks the sending control 1405 on the display interface, and the terminal can send the response speech to the driver.
在一些实施例中,当前终端还可以对历史响应话术进行统计处理,并进行展示。在一 些实施例中,当前还可以用于执行如下步骤:获取来自于其他用户的历史响应话术,然后,确定历史响应话术的输出总数,以及,根据历史响应话术,确定一个或多个话术标签,进而,显示输出总数与话术标签。在一些实施例中,话术标签可以根据实际需要进行设计。例如,可以将历史响应话术的场景作为标签,还可以将历史响应话术的场景,以及该场景中的历史响应话术的次数作为话术标签。又例如,可以将响应话术的响应风格或情感等作为标签。In some embodiments, the current terminal may also perform statistical processing on historical response words and display them. In some embodiments, currently it can also be used to perform the following steps: obtain historical response words from other users, and then determine the total output of historical response words, and, according to the historical response words, determine one or more The speech tag, and furthermore, the total number of outputs and the speech tag are displayed. In some embodiments, the speech tag can be designed according to actual needs. For example, a scene of historical response to verbal skills can be used as a tag, and a scene of historical response to verbal skills and the number of times of historical response to verbal skills in the scene can also be used as the tag of verbal skills. For another example, the response style or emotion of the response language can be used as a label.
仍以图12~图16所示的夸夸场景为例。考虑到用户可以对自己进行夸夸,在实际场景中,可以排除针对自己的夸夸的处理,而获取来自于其他用户的历史的夸夸数据,进行统计分析。例如,当前终端为司机端时,可以统计各个乘客端或其他司机端,对该司机端用户的夸夸数据,并统计这些夸夸的总数和话术标签,在该终端的显示界面上进行显示。例如,图12所示场景中,图12B的显示界面上,显示该司机端累积收到夸夸108次,这即为历史响应话术的输出总数。此外,图2B中还显示有3个话术标签,分别为:“雨天夸夸999+”、“深夜夸夸3”和“假日夸夸66”。该场景中的话术标签由夸夸场景和该场景中的夸夸次数构成。Still take the exaggerated scene shown in Figure 12 to Figure 16 as an example. Considering that users can exaggerate themselves, in actual scenes, exaggeration processing for themselves can be excluded, and historical exaggeration data from other users can be obtained for statistical analysis. For example, when the current terminal is the driver's terminal, you can count the exaggeration data of each passenger or other driver's terminal, and count the total number of exaggerations and the verbal label, and display it on the terminal's display interface . For example, in the scene shown in FIG. 12, the display interface of FIG. 12B shows that the driver has received 108 boasts, which is the total output of historical response words. In addition, Figure 2B also shows three linguistic tags, namely: "Rainy day boast 999+", "Late night boast 3" and "Holiday boast 66". The language tag in this scene consists of the exaggeration scene and the number of exaggerations in the scene.
图17是根据本申请一些实施例所示的人机交互的方法的示例性流程图。如步骤510所述,交互指令可以包括语音数据。第一分类模型又可以称为训练好的多语种语音分类器(简称“多语种语音分类器”)。在一些实施例中,人机交互的方法包括:Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. As described in step 510, the interactive instruction may include voice data. The first classification model can also be called a trained multilingual speech classifier (referred to as "multilingual speech classifier"). In some embodiments, the method of human-computer interaction includes:
步骤1710,采集当前的语音数据。Step 1710: Collect current voice data.
在一些实施例中,可以应用于语音交互场景,在该场景中,终端可以实时采集用户发出的语音数据,并进行后续处理。采集到的语音数据可以为中文语音、英文语音、日语语音、韩文语音等中的任意一种,在此不作限制。In some embodiments, it can be applied to a voice interaction scenario. In this scenario, the terminal can collect voice data sent by the user in real time and perform subsequent processing. The collected voice data can be any of Chinese voice, English voice, Japanese voice, Korean voice, etc., and there is no restriction here.
在一些实施例中,可以在用户指示启动语音交互功能后,由终端自动监听并采集用户发出的语音数据。或者,还可以为用户按压显示界面上的语义输入按键,来触发并采集语音数据的。In some embodiments, after the user instructs to start the voice interaction function, the terminal can automatically monitor and collect the voice data sent by the user. Alternatively, the user can also press the semantic input button on the display interface to trigger and collect voice data.
步骤1720,提取语音数据中的语音特征。Step 1720: Extract voice features in the voice data.
在一些实施例中,语音特征可以为多维的fbank特征。具体而言,人耳对声音频谱的响应是非线性的,fbank特征是通过类似于人耳的方式对音频尽进行处理得到的,fbank特征有利于提高语音识别的性能。具体而言,可以通过如下步骤来提取语音数据中的Fbank特征:对语音数据进行时域到频域的信号转换,得到频域语音数据;计算频域语音数据的能量谱,得到语音特征。终端设备采集到的语音数据为线性的时域信号,可以通过傅里叶变换(Fourier transform,FFT),将(时域)语音信号变换为频域语音信号。具体而言,在信号转换过程中, 可以对语音数据进行采样。在此基础上,频域信号中每个频带范围的能量大小不一,不同音素的能量谱不一样。因此,可以计算频域语音数据的能量谱,即可得到语音特征。计算能量谱的方法此处不赘述。例如,若在1710中采集到的语音数据时的采样频率为16khz,则可以在该步骤中提取出fbank40维特征。In some embodiments, the voice feature may be a multi-dimensional fbank feature. Specifically, the human ear's response to the sound spectrum is non-linear, the fbank feature is obtained by processing audio in a manner similar to the human ear, and the fbank feature is beneficial to improve the performance of speech recognition. Specifically, the Fbank feature in the voice data can be extracted through the following steps: signal conversion from time domain to frequency domain is performed on the voice data to obtain the frequency domain voice data; and the energy spectrum of the frequency domain voice data is calculated to obtain the voice feature. The voice data collected by the terminal device is a linear time-domain signal, and the (time-domain) voice signal can be transformed into a frequency-domain voice signal through Fourier transform (FFT). Specifically, during the signal conversion process, the voice data can be sampled. On this basis, the energy of each frequency band in the frequency domain signal is different, and the energy spectrum of different phonemes is different. Therefore, the energy spectrum of the frequency domain speech data can be calculated, and the speech features can be obtained. The method of calculating the energy spectrum will not be repeated here. For example, if the sampling frequency of the voice data collected in 1710 is 16khz, then the fbank40-dimensional feature can be extracted in this step.
在一些实施例中,在进行特征提取步骤之前,还可以对语音数据进行预处理。关于预处理的更多细节参见图5及其相关描述。In some embodiments, the voice data may also be preprocessed before the feature extraction step. For more details about preprocessing, see Figure 5 and its related description.
步骤1730,利用训练好的多语种语音分类器处理语音特征,得到分类识别结果,多语种语音分类器用于判断语音数据中是否包含多语种指定话术中的任意一种指定话术。Step 1730: Use the trained multilingual speech classifier to process the speech features to obtain the classification recognition result. The multilingual speech classifier is used to determine whether the speech data contains any one of the multilingual specified words.
在一些实施例中,多语种语音分类器可以对多(种)语种的语音数据进行分类识别。此时,多语种语音分类器所能识别的语种类型,与多语种语音分类器训练过程中的语音样本的语种类型一致。In some embodiments, the multilingual speech classifier can classify and recognize speech data in multiple languages. At this time, the language types that the multilingual speech classifier can recognize are consistent with the language types of the speech samples in the training process of the multilingual speech classifier.
在一些实施例中,分类识别结果可以为多分类结果,其中,包括双分类结果。具体的,分类识别结果用于指示语音数据为正样本或负样本;或者,分类识别结果为正样本与负样本之间的程度级别,每个程度级别对应于正样本或负样本。因此,当程度级别对应正样本时,分类识别结果指示语音数据中包含指定话术;当程度级别对应负样本时,分类识别结果指示语音数据中不包含指定话术。In some embodiments, the classification recognition result may be a multi-classification result, including dual classification results. Specifically, the classification recognition result is used to indicate that the voice data is a positive sample or a negative sample; or, the classification recognition result is a degree level between a positive sample and a negative sample, and each degree level corresponds to a positive sample or a negative sample. Therefore, when the degree level corresponds to a positive sample, the classification recognition result indicates that the voice data contains the specified idiom; when the degree level corresponds to a negative sample, the classification recognition result indicates that the voice data does not contain the specified idiom.
在一些实施例中,分类识别结果可以分为两种:“是”或“否”。其中,分类识别结果为“是”,则表示语音数据中包含多语种指定话术中的一种语种的指定话术;反之,分类识别结果为“否”,则表示语音数据与任意一种语种的指定话术都无关,语音数据中不包含指定话术。In some embodiments, the classification recognition result can be divided into two types: "Yes" or "No". Among them, the classification recognition result is "Yes", it means that the voice data contains the designated dialects of one of the multilingual designated dialects; on the contrary, the classification recognition result is "No", it means that the voice data is compatible with any language. The specified words are irrelevant, and the specified words are not included in the voice data.
应当理解,分类识别结果还可以有其他的表现形式。例如,分类识别结果可以为符号、数字、字符(包含各语种的字符,例如中文字符和英文字符)中的一种或多种。例如,分类识别结果可以为“+”或“-”;或者,分类识别结果也可以为“正”或者“负”;或者,分类识别结果还可以为“结果1”或“结果2”;或者,分类识别结果还可以为“正样本”或“负样本”。在二分类结果中,前述各表示形式所指示的结果可以自定义设计。例如,分类识别结果为“是”,可以表示语音数据与任意一种语种的指定话术都无关,语音数据中不包含指定话术;分类识别结果为“否”,可以表示语音数据中包含多语种指定话术中的一种语种的指定话术。It should be understood that the classification recognition result may also have other manifestations. For example, the classification recognition result may be one or more of symbols, numbers, and characters (including characters of various languages, such as Chinese characters and English characters). For example, the classification recognition result can be "+" or "-"; or, the classification recognition result can also be "positive" or "negative"; or, the classification recognition result can also be "result 1" or "result 2"; or , The classification recognition result can also be "positive sample" or "negative sample". In the two-category results, the results indicated by the aforementioned representations can be customized. For example, if the classification recognition result is "Yes", it can mean that the voice data has nothing to do with the specified dialects in any language, and the specified dialects are not included in the voice data; the classification recognition result is "No", which can mean that the voice data contains more than one language. A language-specific language in a language-specific language.
在双分类结果的实施例中,可以直接根据双分类结果确认分类识别结果的指示。In the embodiment of the dual classification result, the indication of the classification recognition result can be confirmed directly according to the dual classification result.
在一些实施例中,分类识别结果还可以为n个级别,n为大于1的整数。此时,n个 级别是指识别出语音数据属于正样本到负样本之间的程度分级。例如,级别最高,则判断出语音数据属于正样本的程度越高;反之,级别越低,则判断出语音数据属于负样本的程度越高,属于正样本的程度越低。例如,则若分类识别结果为n,则级别最高,判断出语音数据属于正样本的程度较高;或者,若分类识别结果为1,则级别最低,判断出语音数据属于正样本的程度较低。反之亦可成立。也就是,级别最高,则判断出语音数据属于正样本的程度越低;反之,级别越低,则判断出语音数据属于正样本的程度越高。In some embodiments, the classification recognition result may also be n levels, and n is an integer greater than 1. At this time, n levels refer to the degree to which it is recognized that the speech data belongs to the positive sample to the negative sample. For example, if the level is the highest, it is judged that the voice data belongs to a positive sample higher; conversely, the lower the rank is, it is judged that the voice data belongs to a negative sample and the degree of a positive sample is lower. For example, if the classification and recognition result is n, the level is the highest, and it is judged that the voice data belongs to the positive sample; or if the classification and recognition result is 1, the level is the lowest, and the voice data is judged to be the lower degree of the positive sample. . The opposite can also be established. That is, if the level is the highest, it is determined that the voice data belongs to a positive sample to a lower degree; conversely, the lower the level is, it is judged that the voice data belongs to a positive sample to a higher degree.
在一些实施例中,在多分类结果的实施例中,还需要基于分级结果,确定分类识别结果的指示。此时,可以预设正样本与负样本各自对应的级别。例如,针对10分类结果(n为10,共10个级别),则1~5级可以对应负样本,6~10级可以对应正样本。那么,若分类级别结果为1,则分类识别结果指示语音数据中不包含指定话术;若分类级别结果为8,则分类识别结果指示语音数据中包含指定话术。In some embodiments, in the embodiment of multiple classification results, it is also necessary to determine the indication of the classification recognition result based on the classification result. At this time, the respective levels corresponding to the positive samples and the negative samples can be preset. For example, for 10 classification results (n is 10, 10 levels in total), levels 1 to 5 can correspond to negative samples, and levels 6 to 10 can correspond to positive samples. Then, if the classification level result is 1, the classification recognition result indicates that the speech data does not contain the specified speech; if the classification level result is 8, the classification recognition result indicates that the speech data contains the specified speech.
在一些实施例中,正样本和负样本为多语种语音分类器训练阶段所使用的训练样本,其中,正样本是为携带有指定话术的多语种语音数据,负样本为与指定话术无关的多语种语音数据。应当理解,训练样本中的正样本(或负样本)包含多种语种的语音数据,而分类识别结果所涉及到的正样本(或负样本),是指识别出语音数据为正样本(或负样本)中的一种语种的语音数据。关于多语种语音分类器的训练过程参见图7及其相关描述。In some embodiments, the positive sample and the negative sample are training samples used in the training phase of the multilingual speech classifier, wherein the positive sample is the multilingual speech data carrying the specified speech, and the negative sample is irrelevant to the specified speech. Multilingual voice data. It should be understood that the positive sample (or negative sample) in the training sample contains voice data in multiple languages, and the positive sample (or negative sample) involved in the classification recognition result refers to the recognition that the voice data is a positive sample (or negative sample). The voice data of one language in the sample). See Figure 7 and related descriptions for the training process of the multilingual speech classifier.
步骤1740,当分类识别结果指示语音数据中包含指定话术时,输出针对指定话术的响应话术。Step 1740: When the classification recognition result indicates that the voice data contains the specified speech, output the response speech for the specified speech.
在一些实施例中,可以直接输出针对指定话术的响应话术。响应话术可以包括但不限于:响应语音与响应文字中的一种或多种。关于响应话术输出形式的更多细节,参见步骤530及其相关描述。In some embodiments, the response speech for the specified speech may be directly output. The response words may include, but are not limited to: one or more of response voice and response text. For more details about the output form of response speech, see step 530 and its related description.
如前所述,交互指令或语音数据可以是针对自身,也可以是针对通信对方用户。具体参见步骤530及其相关描述。As mentioned above, the interactive instruction or voice data can be directed to itself or to the user of the communication partner. For details, refer to step 530 and related descriptions.
在语音交互场景中,更具体的,在针对多语种用户的语音交互场景中,终端可以采集语音数据,并对语音数据进行语义识别,并在识别出用户的语义后,输出响应话术。其中,终端可以采用单语种的声学模型来识别语音数据的语义。但是,在多语种的语音交互场景中,单语种的声学模型无法满足多语种用户的语音交互需求。本说明书的一些实施例中的多语种语音分类器能够实现对多语种指定话术进行分类处理,在保证了分类效果的基础上,还能够将复杂的语音识别问题转化为简单的分类问题,从而,无需再为各语种分别训练并维护声学模型,节省资源维护量;并且,相较于多语种声学模型的分别处理,分类器的处理效率更高, 有利于提高语音识别效率,进而,也有利于提高响应话术的应答准确率,降低无效语音交互对用户的干扰,提高语音交互效果。In the voice interaction scenario, more specifically, in the voice interaction scenario for multilingual users, the terminal can collect voice data, perform semantic recognition on the voice data, and output response words after recognizing the semantics of the user. Among them, the terminal can use a monolingual acoustic model to recognize the semantics of the voice data. However, in multilingual voice interaction scenarios, a monolingual acoustic model cannot meet the voice interaction needs of multilingual users. The multilingual speech classifier in some embodiments of this specification can realize the classification processing of multilingual designated words. On the basis of ensuring the classification effect, it can also convert complex speech recognition problems into simple classification problems. , There is no need to separately train and maintain acoustic models for each language, saving resources and maintenance; and, compared to the separate processing of multilingual acoustic models, the processing efficiency of the classifier is higher, which is conducive to improving the efficiency of speech recognition. It is helpful to improve the response accuracy of response speech, reduce the interference of invalid voice interaction to the user, and improve the voice interaction effect.
图18是根据本申请一些实施例所示的终端的模块图。在一些实施例中,图5-7、图17可以在移动设备或终端上(例如,乘客端或司机端等)执行,例如,通过移动设备的处理器340执行。Fig. 18 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5-7 and FIG. 17 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
在一些实施例中,获取模块440可以包括采集模块1810。确定模块420可以包括处理模块1820。响应模块430可以包括输出模块1830。在一些实施例中,终端1800可以包括:采集模块1810、提取模块410、处理模块1820、输出模块1830。In some embodiments, the acquisition module 440 may include the acquisition module 1810. The determining module 420 may include a processing module 1820. The response module 430 may include an output module 1830. In some embodiments, the terminal 1800 may include: a collection module 1810, an extraction module 410, a processing module 1820, and an output module 1830.
采集模块1810,用于采集当前的语音数据。The collection module 1810 is used to collect current voice data.
提取模块410,用于提取语音数据中的语音特征。在一些实施例中,提取模块410还用于提取语音特征之前,对语音数据进行预处理。具体参见图17及其相关描述。The extraction module 410 is used to extract voice features in voice data. In some embodiments, the extraction module 410 is also used to preprocess the voice data before extracting voice features. See Figure 17 and related descriptions for details.
处理模块1820,用于利用训练好的多语种语音分类器处理语音特征,得到分类识别结果。The processing module 1820 is used to process the voice features by using the trained multilingual voice classifier to obtain the classification recognition result.
输出模块1830,用于当分类识别结果指示语音数据中包含指定话术时,输出针对指定话术的响应话术。在一些实施例中,输出模块1830可以用于:当指定话术针对自身时,直接输出针对指定话术的响应话术;或者,当指定话术针对当前通信的对方用户时,向对方用户输出响应话术。The output module 1830 is used for outputting the response words for the specified words when the classification recognition result indicates that the voice data contains the specified words. In some embodiments, the output module 1830 can be used to: when the specified speech is directed to itself, directly output the response speech for the specified speech; or, when the specified speech is directed to the opposite user of the current communication, output to the other user Respond to words.
在一些实施例中,获取模块440可以包括训练模块(图18未示出),该训练模块用于通过训练获取多语种语音分类器,具体参见图7及其相关描述。In some embodiments, the acquisition module 440 may include a training module (not shown in FIG. 18), which is used to acquire a multilingual speech classifier through training. For details, refer to FIG. 7 and related descriptions.
在一些实施例中,终端1800中的获取模块440还可以用于获取来自于其他用户的历史响应话术,确定历史响应话术的输出总数,以及根据历史响应话术,确定一个或多个话术标签。输出模块1830还用于显示输出总数与话术标签。In some embodiments, the acquisition module 440 in the terminal 1800 can also be used to acquire historical response words from other users, determine the total output of historical response words, and determine one or more words based on the historical response words.术 label. The output module 1830 is also used to display the total number of outputs and the verbal label.
图19是根据本申请一些实施例所示的人机交互的方法的示例性流程图。第二分类模型又可以称为训练好的情感分类器(简称“情感分类器”)。在一些实施例中,人机交互的方法包括:Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. The second classification model can also be called a trained sentiment classifier (referred to as "sentiment classifier"). In some embodiments, the method of human-computer interaction includes:
步骤1910,采集当前的语音数据。具体可参见步骤1710,此处不再赘述。Step 1910: Collect current voice data. For details, refer to step 1710, which will not be repeated here.
步骤1920,识别语音数据的被响应情感;被响应情感由文本情感识别结果或语音情感识别结果中的一种或多种获得。Step 1920: Recognize the respondent emotion of the voice data; the respondent emotion is obtained from one or more of the text emotion recognition result or the voice emotion recognition result.
如前所述,可以对语音数据或语音特征进行处理,将处理的结果作为语音情感识别结果;对文本数据或文本特征进行处理,将处理的结果作为文本情感识别结果。情感识别结果 包括语音情感识别结果和/或文本情感识别结果。As mentioned above, it is possible to process voice data or voice features, and use the processed result as the voice emotion recognition result; process text data or text features, and use the processed result as the text emotion recognition result. Emotion recognition results include speech emotion recognition results and/or text emotion recognition results.
在一些实施例中,可以对语音数据进行文本情感识别、语音情感识别中的一种或多种,从而,基于文本情感识别得到文本情感识别结果,基于语音情感识别得到语音情感识别结果,进而,采用二者中的一个或多个来确定语音数据的情感。具体参见图20及其相关描述。In some embodiments, one or more of text emotion recognition and speech emotion recognition may be performed on the speech data, so that the text emotion recognition result is obtained based on the text emotion recognition, and the speech emotion recognition result is obtained based on the speech emotion recognition, and further, One or more of the two are used to determine the emotion of the voice data. See Figure 20 and related descriptions for details.
步骤1930,确定被响应情感状态对应的响应情感。Step 1930: Determine the response emotion corresponding to the response emotion state.
如前所述,被响应情感为采集到的用户发出的语音数据的情感,而响应情感则用于对语音数据进行响应时使用,也就是,响应语音的情感。As mentioned above, the response emotion is the emotion of the collected voice data sent by the user, and the response emotion is used when responding to the voice data, that is, the emotion in response to the voice.
在一些实施例中,所涉及到的情感(包括被响应情感、响应情感)类型可以包括但不限于:失落、平静、热情或者激情等,实际场景中可以根据需要自定义处理。例如,在一些实施例中,情感还可以包括:喜悦、悲伤、痛苦、欣慰、兴奋等,不作穷举。In some embodiments, the types of emotions involved (including responding emotions and responding emotions) may include, but are not limited to: loss, calm, enthusiasm, or passion, etc., and the actual scene can be customized according to needs. For example, in some embodiments, emotions may also include: joy, sadness, pain, gratification, excitement, etc., and not exhaustively.
此外,被响应情感和响应情感所包含的情感类别可以相同也可以不同。例如,被响应情感与响应情感均为失落、平静、热情或者激情这四种情感。又例如,被响应情感可以包括正面情感(例如,高兴、激动、喜悦、兴奋等)与负面情感(例如,失落、悲伤、痛苦等),而响应情感可以仅包含正面情感,以针对用户的负面情感进行安抚。In addition, the response emotion and the emotion category contained in the response emotion may be the same or different. For example, the responding emotion and responding emotion are the four emotions of loss, calm, enthusiasm, or passion. For another example, the respondent emotion may include positive emotions (for example, happiness, excitement, joy, excitement, etc.) and negative emotions (for example, loss, sadness, pain, etc.), and the response emotion may only include positive emotions to target the user's negative emotions. Emotionally soothes.
在一些实施例中,还可以预设被响应情感与响应情感之间的对应关系。该对应关系可以预先存储在终端中,或可存储在终端可读的存储位置,例如云端,对此不作限定。具体而言,一个被响应情感可以对应于一个或多个响应情感。例如,若被响应情感为低落,则与之对应的响应情感可以为:喜悦或安抚。In some embodiments, the corresponding relationship between the response emotion and the response emotion can also be preset. The corresponding relationship may be stored in the terminal in advance, or may be stored in a storage location readable by the terminal, such as the cloud, which is not limited. Specifically, one responding emotion can correspond to one or more responding emotions. For example, if the response emotion is low, the corresponding response emotion can be: joy or comfort.
步骤1940,输出针对语音数据的响应语音,响应语音具备响应情感。In step 1940, a response voice for the voice data is output, and the response voice has a response emotion.
如前所述,响应话术可以通过响应语音输出。在一些实施例中,可以首先获取针对语音数据的响应内容,然后,根据响应情感与响应内容,生成响应语音,从而,可以输出响应语音。如此,输出的响应语音也就具备响应情感。As mentioned earlier, response speech can be output by response voice. In some embodiments, the response content for the voice data may be acquired first, and then the response voice may be generated according to the response emotion and the response content, so that the response voice may be output. In this way, the output response voice also has response emotion.
在一些实施例中,对于响应内容的确定方式无特别限定。例如,可以提前预设关键词与响应内容之间的对应关系,从而,通过识别语音数据中携带的关键词,来获取该关键词对应的响应内容,以作为该语音数据的响应内容。又例如,还可以利用神经网络模型来处理语音数据,进一步的,获取神经网络模型输出的响应内容。又例如,可以通过图5或图6对应的方法确定响应内容。In some embodiments, there is no particular limitation on how the response content is determined. For example, the corresponding relationship between keywords and response content can be preset in advance, so that the response content corresponding to the keyword can be obtained by recognizing the keyword carried in the voice data as the response content of the voice data. For another example, the neural network model can also be used to process voice data, and further, the response content output by the neural network model can be obtained. For another example, the response content can be determined by the method corresponding to FIG. 5 or FIG. 6.
在基于响应内容与响应情感生成响应语音时,可以利用默认的声音(音色)或用户选择的音色来生成响应语音。例如,用户可以选择某个名人的音色来作为响应语音的音色,从而,终端在生成响应语音时,就按照用户选择的名人的音色来进行生成。当然,这种实现方 式的前提是终端能够获取到该名人的音色及授权,不作赘述。When generating the response voice based on the response content and the response emotion, the default voice (timbre) or the user-selected timbre can be used to generate the response voice. For example, the user may select the timbre of a certain celebrity as the timbre of the response voice, so that the terminal generates the response voice according to the timbre of the celebrity selected by the user. Of course, the premise of this implementation is that the terminal can obtain the tone and authorization of the celebrity, so I will not repeat it.
在一些实施例中,还可以提前针对所有可能的响应内容,提前生成多个不同情感的候选语音,并将这些候选语音预存在可读的存储位置。从而,终端设备只需要在确定了响应情感后,只需要在存储位置中提取与该响应情感、响应内容相对应的一个候选语音,作为响应语音输出即可。在一些实施例中,存储位置存储的候选语音也可以是由人工提前录制好的。In some embodiments, it is also possible to generate multiple candidate voices with different emotions in advance for all possible response content, and pre-store these candidate voices in a readable storage location. Therefore, the terminal device only needs to extract a candidate voice corresponding to the response emotion and the response content in the storage location after determining the response emotion, and output it as the response voice. In some embodiments, the candidate voice stored in the storage location may also be manually recorded in advance.
现有技术的语音交互场景中,终端一般会按照默认的语调、语气来输出应答数据,这种人机交互方式的响应情感单一,不能满足个性化场景中用户对语音情感的需求。本说明书中一些实施例可以实时针对用户的情感,选择不同的应答情感,能够有效提升响应语音与用户情绪的匹配程度,满足用户在不同情感状态下的情感需求,真实感和代入感更强,提高了语音交互体验,这也解决了现有的语音交互场景中响应语音与用户的情绪匹配度较低的问题。In the voice interaction scenario of the prior art, the terminal generally outputs response data according to the default intonation and tone. This human-computer interaction method has a single response emotion and cannot meet the user's voice emotion needs in a personalized scenario. Some embodiments in this specification can select different response emotions according to the user's emotion in real time, which can effectively improve the matching degree of the response voice and the user's emotion, meet the user's emotional needs in different emotional states, and have a stronger sense of reality and substitution. The voice interaction experience is improved, which also solves the problem of low matching degree between the response voice and the user's emotion in the existing voice interaction scene.
图20是根据本申请一些实施例所示的确定被响应情感的方法的示例性流程图。在一些实施例中,确定被响应情感可以包括以下步骤:Fig. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application. In some embodiments, determining the respondent emotion may include the following steps:
步骤1922,提取语音数据的语音特征。Step 1922: Extract the voice features of the voice data.
在一些实施例中,可以提取语音数据的音频特征,然后,对音频特征进行归一化处理,并组成特征向量,得到语音数据的语音特征。例如,可以提取语音数据的基频特征、短时能量特征、短时赋值特征、短时过零率特征,然后,对这些特征分别进行归一化处理,组成一帧n维的特征向量,n为大于1的整数。实际场景中,不同语音数据获得的特征向量的维度可以不同。换言之,特征向量的n值可以根据实际场景或项目需要或根据经验值,来进行调整。对此不作限定。In some embodiments, the audio features of the voice data can be extracted, and then the audio features are normalized to form a feature vector to obtain the voice features of the voice data. For example, you can extract the fundamental frequency feature, short-term energy feature, short-term assignment feature, and short-term zero-crossing rate feature of speech data, and then normalize these features to form a frame of n-dimensional feature vector, n Is an integer greater than 1. In actual scenarios, the dimensions of feature vectors obtained from different voice data may be different. In other words, the n value of the feature vector can be adjusted according to actual scene or project needs or according to empirical values. There is no restriction on this.
步骤1924,利用训练好的情感分类器处理语音特征,得到情感识别结果。Step 1924: Use the trained emotion classifier to process the voice features to obtain the emotion recognition result.
在一些实施例中,情感分类器用于对语音数据的情感进行识别。关于情感分类器的训练参见图9及其相关描述。关于情感分类模型的更多细节可以参见图8及其相关描述。In some embodiments, the sentiment classifier is used to recognize the sentiment of the speech data. Refer to Figure 9 and related descriptions for the training of the sentiment classifier. For more details about the sentiment classification model, please refer to Figure 8 and its related descriptions.
步骤1926,将情感识别结果所指示的情感确定被响应情感。In step 1926, the emotion indicated by the emotion recognition result is determined to be a response emotion.
情感分类器的输出为情感识别结果,而情感识别结果所指示的情感,则与情感识别结果的表示方式有关。The output of the emotion classifier is the result of emotion recognition, and the emotion indicated by the result of emotion recognition is related to the way of expression of the result of emotion recognition.
情感识别结果可以为多分类结果。例如,将情感分为失落,平静,热情,激情四分类。示例性,所述情感识别结果为所述语音数据在各情感中的概率,所述情感识别结果所指示的情感为所述概率最高的一种情感;或者,所述情感识别结果所指示的情感,为具备指示标识的一种情感;或者,所述情感识别结果为所述语音数据在各情感中的分值,所述情感识别结果所指示的情感,为所述分值落在的分数区间对应的一种情感。The emotion recognition result can be a multi-classification result. For example, divide emotions into four categories: loss, calm, enthusiasm, and passion. Exemplarily, the emotion recognition result is the probability of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the emotion with the highest probability; or, the emotion indicated by the emotion recognition result , Is an emotion with an indication mark; or, the emotion recognition result is the score of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the score interval in which the score falls Corresponding emotion.
具体的,情感识别结果可以为语音数据的情感概率,其中,情感识别结果所指示的情感(第一情感)为概率最高的一种情感。例如,情感分类器输出的情感识别结果可以为:失落2%,平静20%,热情80%,激情60%,那么,该情感识别结果所指示的情感为:热情。Specifically, the emotion recognition result may be the emotion probability of the speech data, where the emotion (first emotion) indicated by the emotion recognition result is the emotion with the highest probability. For example, the emotion recognition result output by the emotion classifier may be: loss 2%, calm 20%, enthusiasm 80%, and passion 60%. Then, the emotion indicated by the emotion recognition result is enthusiasm.
此外,情感识别结果可以输出具备一个指示标识的多分类结果,此时,情感识别结果所指示的情感为具备指示标识的一种情感。指示标识可以为文字、数字、字符等中的一种或多种。举例说明。若1为指示标识,情感分类器输出的情感识别结果为:失落1,平静0,热情0,激情0,那么,该情感识别结果所指示的情感为:失落。In addition, the emotion recognition result can output a multi-classification result with one indicator. At this time, the emotion indicated by the emotion recognition result is an emotion with the indicator. The indication mark can be one or more of words, numbers, characters, and so on. for example. If 1 is an indicator, the emotion recognition result output by the emotion classifier is: loss 1, calm 0, enthusiasm 0, passion 0, then the emotion indicated by the emotion recognition result is: loss.
除此之外,情感识别结果还可以输出情感分,而各情感也分别对应于不同的分数区间,从而,情感识别结果所指示的情感,即为情感分所落在的分数区间对应的一种情感。In addition, emotion recognition results can also output emotion points, and each emotion also corresponds to a different score interval. Therefore, the emotion indicated by the emotion recognition result is a kind of score interval corresponding to the emotion point. emotion.
通过对语音数据确定被响应情感,仅从声音这一维度出发,针对声音中携带的情感进行识别,实现方式简便可行。By determining the response emotion from the voice data, starting from the sound dimension only, the emotion carried in the sound is recognized, and the realization method is simple and feasible.
图21是根据本申请一些实施例所示的确定被响应情感的方法的示例性流程图。在一些实施例中,可以通过以下步骤确定被响应情感:FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application. In some embodiments, the responding emotion can be determined through the following steps:
步骤1922,提取语音数据的语音特征。Step 1922: Extract the voice features of the voice data.
步骤1924,利用训练好的情感分类器处理语音特征,得到情感识别结果。Step 1924: Use the trained emotion classifier to process the voice features to obtain the emotion recognition result.
步骤1922~1924的处理方式同前,不作赘述。The processing methods of steps 1922 to 1924 are the same as before, and will not be described in detail.
步骤1926,将语音数据转换为文本数据。Step 1926: Convert the voice data into text data.
步骤1926~1928用于从内容的角度出发,获取情感解析结果。应当理解,步骤1921~1924和步骤1926~1928之间无执行顺序上的关联,除步骤1922和1924顺序执行、1926和1928顺序执行之外,本说明书的一些实施例对这些步骤的执行次序无特别限定,可以如图21所示顺序执行,也可以同时执行,或在执行步骤1922后,即开始执行步骤1926等,不作穷举。 Steps 1926 to 1928 are used to obtain emotional analysis results from the perspective of content. It should be understood that there is no correlation in the order of execution between steps 1921 to 1924 and steps 1926 to 1928. Except for the sequential execution of steps 1922 and 1924, and the sequential execution of 1926 and 1928, some embodiments of this specification have no relation to the order of execution of these steps. Especially limited, it can be executed sequentially as shown in FIG. 21, or can be executed simultaneously, or after step 1922 is executed, step 1926, etc. are started to be executed, which is not exhaustive.
在一些实施例中,可以通过语音解码器,将语音数据来转换为文本数据,对此不作详述。In some embodiments, the voice data can be converted into text data through a voice decoder, which will not be described in detail.
步骤1928,对文本数据进行情感解析,得到情感解析结果。Step 1928: Perform sentiment analysis on the text data to obtain an sentiment analysis result.
在一些实施例中,可以识别文本数据中的情感关联词,然后,根据情感关联词,确定文本数据的情感解析结果。关于情感关联词的更多细节参见图8及其相关描述。In some embodiments, emotion-related words in the text data can be identified, and then, based on the emotion-related words, the emotion analysis result of the text data is determined. Refer to Figure 8 and its related descriptions for more details about emotional related words.
除此之外,还可以为各情感关联词预设情感分值。从而,可以识别文本数据中的所有情感关联词,然后,将这些情感关联词的情感分值进行加权处理(也可以直接求和或求平均),然后,将加权分值作为情感解析结果。In addition, the sentiment score can also be preset for each sentiment related word. Therefore, all the emotional related words in the text data can be identified, and then the emotional scores of these emotional related words can be weighted (also can be directly summed or averaged), and then the weighted scores are used as the emotional analysis result.
步骤19210,根据情感识别结果与情感解析结果,确定被响应情感。Step 19210: Determine the responding emotion according to the emotion recognition result and the emotion analysis result.
在一些实施例中,若情感识别结果与情感解析结果为分值的形式,则可以对二者进行加权处理(求和或求平均),然后,将加权值所落在的情感区间对应的一种情感,作为被响应情感。若二者中有一种或多种不是分值形式,则可以根据预设的算法,将情感识别结果(或情感解析结果)转换为分值形式后,进行加权处理,以确定被响应情感。In some embodiments, if the emotion recognition result and the emotion analysis result are in the form of scores, the two can be weighted (summing or averaging), and then the weighted value falls in the emotional interval corresponding to one This kind of emotion, as the responding emotion. If one or more of the two are not in the form of scoring, the emotion recognition result (or the result of emotion analysis) can be converted into the scoring form according to a preset algorithm, and then weighting is performed to determine the response emotion.
在一些实施例中,当情感识别结果与情感解析结果各自指示的情感类别相同时,将情感识别结果所指示的情感类别作为被响应情感。或者,当情感识别结果与情感解析结果各自指示的情感类别不同时,将情感识别结果与情感解析结果进行加权,并将加权处理后指示的情感类别作为被响应情感。(转换为分值后加权,如前,不再赘述)。In some embodiments, when the emotion category indicated by the emotion recognition result and the emotion analysis result are the same, the emotion category indicated by the emotion recognition result is taken as the respondent emotion. Or, when the emotion categories indicated by the emotion recognition results and the emotion analysis results are different, the emotion recognition results and the emotion analysis results are weighted, and the emotion categories indicated after the weighting process are used as the responding emotions. (Weighted after conversion into points, as before, no longer repeat).
通过对语音数据和语音数据转换的文本数据确定被响应情感,可以从声音和内容(文本)两个维度出发,更加综合全面的解析用户所发出语音数据的情感状态,有利于提高识别结果的精度,进而,缩短响应语音与用户情感需求之间的差距,更加人性化,真实感也更强。By determining the response emotion from the voice data and the text data converted from the voice data, it can start from the two dimensions of sound and content (text), and more comprehensively analyze the emotional state of the voice data sent by the user, which is beneficial to improve the accuracy of the recognition result. , In turn, shorten the gap between the response voice and the user’s emotional needs, making it more humane and more realistic.
图22是根据本申请一些实施例所示的终端的模块图。在一些实施例中,图5、图8-9、图19-图21可以在移动设备或终端上(例如,乘客端或司机端等)执行,例如,通过移动设备的处理器340执行。Fig. 22 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5, 8-9, and FIGS. 19-21 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
在一些实施例中,确定模块420可以包括识别模块2210和响应情感确定模块2220。在一些实施例中,终端2200可以包括:采集模块1810、识别模块2210、响应情感确定模块2220、输出模块1830。In some embodiments, the determination module 420 may include an identification module 2210 and a response emotion determination module 2220. In some embodiments, the terminal 2200 may include: a collection module 1810, an identification module 2210, a response emotion determination module 2220, and an output module 1830.
采集模块1810,还可以用于采集当前的语音数据。The collection module 1810 can also be used to collect current voice data.
识别模块2210,用于识别语音数据的被响应情感。The recognition module 2210 is used to recognize the responding emotion of the voice data.
响应情感确定模块2220,用于确定被响应情感状态对应的响应情感。The response emotion determination module 2220 is used to determine the response emotion corresponding to the response emotion state.
输出模块1830,还用于输出针对语音数据的响应语音,响应语音具备响应情感。The output module 1830 is also used to output a response voice for voice data, and the response voice has a response emotion.
在一些实施例中,识别模块2210可以用于提取语音数据的语音特征。识别模块2210可以用于利用训练好的情感分类器处理语音特征,得到情感识别结果。识别模块2210可以用于将语音数据转换为文本数据;对文本数据进行情感解析,得到情感解析结果。识别模块2210可以用于根据从文本数据中识别的情感关联词,确定情感解析结果。识别模块2210还可以用于根据情感识别结果与情感解析结果,确定被响应情感。In some embodiments, the recognition module 2210 may be used to extract voice features of voice data. The recognition module 2210 can be used to process voice features using a trained emotion classifier to obtain emotion recognition results. The recognition module 2210 can be used to convert voice data into text data; perform sentiment analysis on the text data to obtain an sentiment analysis result. The recognition module 2210 may be used to determine the emotion analysis result according to the emotion related words recognized from the text data. The recognition module 2210 may also be used to determine the responding emotion according to the emotion recognition result and the emotion analysis result.
在一些实施例中,获取模块440可以包括训练模块(图22未示出),该训练模块可以用于通过训练获取第二分类模型(又可以称为情感分类器),具体参见图9及其相关描述。In some embodiments, the acquisition module 440 may include a training module (not shown in FIG. 22), and the training module may be used to acquire a second classification model (also referred to as an emotion classifier) through training. For details, refer to FIG. 9 and FIG. Related description.
在一些实施例中,响应模块430可以包括生成模块(图22未示出),该生成模块可 以用于根据响应情感与响应内容,生成响应语音。In some embodiments, the response module 430 may include a generation module (not shown in Fig. 22), and the generation module may be used to generate a response voice according to the response emotion and the response content.
在现有的语音交互场景中,终端会根据用户发出的语音或文字,确定与这些语音或文字相对应的应答内容,并将应答内容输出给用户。这种处理方式仅能针对交互指令实现应答,人机交互方式过于单一单调,无法满足用户的个性化交互需求。例如,在前述夸夸场景中,用户说“夸夸我吧”,则终端会针对这一人机交互指令,输出默认的夸奖用户的内容。不同的用户说“夸夸我吧”,终端输出的夸奖内容都是一致的,这显然难以满足用户的个性化交互需求,人机交互体验也较差。本说明书的一些实施例可以解决现有技术的如上技术问题。例如,图5、图10或图23等。In the existing voice interaction scenario, the terminal will determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user. This processing method can only achieve responses to interactive instructions, and the human-computer interaction method is too monotonous and cannot meet the user's personalized interaction needs. For example, in the aforementioned exaggeration scene, if the user says "praise me", the terminal will output the default compliment content in response to this human-computer interaction command. Different users say "praise me", and the praise content output by the terminal is the same, which is obviously difficult to meet the user's personalized interaction needs, and the human-computer interaction experience is also poor. Some embodiments of this specification can solve the above technical problems of the prior art. For example, Figure 5, Figure 10 or Figure 23 and so on.
图23是根据本申请一些实施例所示的人机交互的方法的示例性流程图。第三分类模型又可以称为训练好的风格分类器(简称“风格分类器”)。在一些实施例中,人机交互的方法包括:Fig. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. The third classification model can also be called a trained style classifier ("style classifier" for short). In some embodiments, the method of human-computer interaction includes:
步骤2310,接收针对目标对象的交互指令。Step 2310: Receive an interactive instruction for the target object.
在一些实施例中,交互指令是指接收到来自于用户的语音数据或文本数据。以前述夸夸场景为例,此处可以是接收到来自于用户的文本数据“夸夸我吧”,也可以是采集到用户发出的语音数据“夸夸我吧”。In some embodiments, the interactive instruction refers to receiving voice data or text data from the user. Taking the aforementioned exaggeration scenario as an example, here can be the text data "praise me" received from the user, or the voice data "praise me" sent by the user is collected.
用户可以是所应用的终端所属的用户。目标对象可以为当前终端所属用户;或者,目标对象为与当前终端存在通信的对方用户。即,交互指令可以针对不同目标对象,具体示例与指定话术针对不同目标对象类似,具体参见步骤1710,此处不再赘述。The user may be the user to which the terminal to which it is applied belongs. The target object may be a user to which the current terminal belongs; or, the target object is a counterparty user who communicates with the current terminal. That is, the interactive instruction can be directed to different target objects, and the specific example is similar to that of specifying the words to target different target objects. For details, refer to step 1710, which will not be repeated here.
步骤2320,确定目标对象的响应风格,响应风格与目标对象的历史数据相关。Step 2320: Determine the response style of the target object, and the response style is related to the historical data of the target object.
关于历史数据可以参见图5及其相关描述。历史数据均能够直接或侧面体现目标对象个人喜好的响应风格,因此,可以基于历史数据来确定目标对象的响应风格。具体参见图24及其相关描述。For historical data, see Figure 5 and its related descriptions. Historical data can directly or laterally reflect the response style of the target object's personal preference. Therefore, the response style of the target object can be determined based on the historical data. See Figure 24 and related descriptions for details.
步骤2330,根据响应风格与交互指令,确定响应话术。Step 2330: Determine the response language according to the response style and the interactive command.
在确定出响应风格后,即可基于前述交互指令,获取出具备该响应风格的响应话术。应当理解,响应话术的内容与交互指令相关。例如,若接收到的交互指令为“夸夸司机”,则此时响应话术的内容就是对司机的夸赞内容;若接收到的交互指令为“夸夸乘客”,则响应话术的内容就是对乘客的夸赞内容。After the response style is determined, based on the aforementioned interactive instructions, a response language with the response style can be obtained. It should be understood that the content of the response words is related to the interactive instructions. For example, if the received interactive instruction is "Praise the driver", then the content of the response word is praise to the driver; if the received interactive instruction is "Praise the passenger", then the content of the response word is Compliments to passengers.
在一些实施例中,可以预设每种响应风格对应的候选话术,从而,可以在确定的响应风格对应的多个候选话术中,确定出一个响应话术。从多个候选话术中确定响应话术可以参见步骤530及其相关描述。In some embodiments, the candidate speech skills corresponding to each response style can be preset, so that one response speech strategy can be determined from the multiple candidate speech skills corresponding to the determined response style. Refer to step 530 and related descriptions for determining the response language from a plurality of candidate words.
在一些实施例中,可以预设每种响应风格对应的候选话术的优先级,优先选择优先级较高的候选话术作为响应话术。In some embodiments, the priority of the candidate speech corresponding to each response style may be preset, and the candidate speech with a higher priority is preferentially selected as the response speech.
在一些实施例中,可以对交互指令进行解析,得到交互指令的情感风格,然后,结合响应风格与情感风格,确定响应风格,进而,确定响应风格对应的响应话术。如前所述,第一风格为基于第一特征或/和第二特征确定的风格。第一特征为基于交互指令确定的特征。在一些实施例中,第一风格包括交互指令的情感风格。在一些实施例中,交互指令的情感风格可以包括交互指令的被响应情感,更多细节可以参见图8或图19及其相关描述。例如,针对交互指令的识别结果中包含有“非常好”,“狠夸”等非常正向的关键词时,则用户的个性化情感风格更倾向为采用非常热情的夸赞风格。In some embodiments, the interaction instruction may be parsed to obtain the emotional style of the interaction instruction, and then the response style and the emotional style are combined to determine the response style, and furthermore, the response language corresponding to the response style is determined. As mentioned above, the first style is a style determined based on the first feature or/and the second feature. The first feature is a feature determined based on the interactive instruction. In some embodiments, the first style includes the emotional style of the interactive instruction. In some embodiments, the emotional style of the interactive instruction may include the responding emotion of the interactive instruction. For more details, please refer to FIG. 8 or FIG. 19 and related descriptions. For example, when the recognition result of the interactive instruction contains very positive keywords such as "very good" and "extremely praised", the user's personalized emotional style is more inclined to adopt a very enthusiastic praise style.
结合响应风格与情感风格,确定响应风格时,可以通过将二者进行归一化数值后处理的方式,将二者的归一化分值作加权处理(也可以直接求和或求平均),然后,将加权后的分值对应的风格,作为交互指令对应的响应风格。Combining the response style and the emotional style, when determining the response style, the normalized scores of the two can be weighted by the normalized numerical post-processing of the two (also can be directly summed or averaged), Then, the style corresponding to the weighted score is used as the response style corresponding to the interactive command.
实际场景中,也可以首先判断响应风格与情感风格是否一致。若一致,则二者所指示的风格即为响应风格。若二者不一致,则可以采用前述加权方式,来确定响应风格。In actual scenes, it is also possible to first determine whether the response style is consistent with the emotional style. If they are the same, the style indicated by the two is the response style. If the two are inconsistent, the aforementioned weighting method can be used to determine the response style.
在一些实施例中,将语音识别的文本内容,以及目标对象的行为以及账户信息等历史数据,作为个性化响应话术的参考因素之一;除此之外,单独的对交互指令进行解析,作为个性化响应的另一个参考因素。从而,将两种参考因素进行综合加权,将综合加权后的结果,作为最终结果评价一段时期内目标对象的性格倾向是哪种类型。In some embodiments, the text content of the speech recognition, the behavior of the target object and historical data such as account information are used as one of the reference factors for personalized response speech; in addition, the interactive instructions are parsed separately, As another reference factor for personalized response. Therefore, the two reference factors are comprehensively weighted, and the comprehensive weighted result is used as the final result to evaluate which type of personality tendency of the target object is within a period of time.
应当理解,这种加权结果不是一成不变的,而是随着目标对象使用数据的不断更新而定期在离线端更新目标对象的响应风格,从而更好地适应目标对象在不同时期内的波动(前提假设是目标对象的性格不是单一的而是多元的,是随着环境的变化而波动的)。使用这种加权方式能够更好的拟合目标对象的性格倾向,从而能够更好的对目标对象进行个性化推荐。It should be understood that this weighting result is not static, but the response style of the target object is regularly updated offline with the continuous update of the target object's usage data, so as to better adapt to the fluctuation of the target object in different periods (premise hypothesis It is that the personality of the target object is not single but diverse, and fluctuates with changes in the environment). Using this weighting method can better fit the personality tendency of the target object, and thus can better make personalized recommendations for the target object.
步骤2340,向目标对象输出响应话术。 Step 2340, output response words to the target object.
在该步骤中,基于已经确定了的响应话术,只需要向目标对象输出响应话术即可。关于响应话术输出的更多细节参见步骤530及其相关描述。In this step, based on the determined response words, it is only necessary to output the response words to the target object. Refer to step 530 and related descriptions for more details about the output of the response speech.
本说明书的一些实施例中,当终端与用户进行人机交互时,可以根据目标对象的历史数据,来确定目标对象可能喜欢的响应风格,从而,可以结合响应风格和交互指令两个方面,来确定并输出响应话术,如此,使得响应话术更加贴近目标对象的个性化风格,即便是针对同一个交互指令,针对不同目标对象的响应话术可能不同,解决了现有的人机交互方式单一单调且无法满足用户个性化交互需求的问题,这也使得交互过程更加真实且具备趣味性。In some embodiments of this specification, when the terminal interacts with the user, the response style that the target object may like can be determined according to the historical data of the target object. Therefore, the response style and the interactive instruction can be combined to Determine and output the response words, so that the response words are closer to the personalized style of the target object. Even for the same interactive command, the response words for different target objects may be different, which solves the existing human-computer interaction methods. The problem of singularity and inability to meet the user's personalized interaction needs, which also makes the interaction process more real and interesting.
图24是根据本申请一些实施例所示的人机交互的方法的示例性流程图。Fig. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.
步骤2410,获取目标对象的历史数据。Step 2410: Obtain historical data of the target object.
在一些实施例中,可以仅获取目标对象最近一段时间,例如,最近一周、最近一月或最近三天的历史数据,以降低较远的历史数据对响应风格的影响,使得响应风格更加符合用户当前一段时间的喜好。In some embodiments, only the historical data of the target object in the most recent period of time, for example, the most recent week, the most recent month, or the most recent three days may be acquired, so as to reduce the influence of distant historical data on the response style and make the response style more in line with the user Preferences for the current period of time.
应当理解,对于任意一个目标对象而言,其历史数据是在随时间推移而不断更新的。那么,针对同一个目标对象的同一个交互指令,终端输出的响应话术可以相同,也可以是不同的。例如,若用户喜好发生变化,则终端在确定出的响应风格就会不同,进而输出的响应话术,也可能不同。It should be understood that for any target object, its historical data is constantly updated over time. Then, for the same interactive instruction of the same target object, the response words output by the terminal can be the same or different. For example, if the user's preferences change, the terminal's determined response style will be different, and the output response skills may also be different.
步骤2420,对历史数据进行处理,得到目标对象的对象特征。Step 2420: Process the historical data to obtain the object characteristics of the target object.
如前所述,第一特征为基于历史数据确定。第一特征可以称为对象特征。As mentioned earlier, the first feature is determined based on historical data. The first feature can be referred to as an object feature.
在步骤2410中,可能采集到文本数据,也可能采集到语音数据。此时,可以通过对语音数据进行语义识别的方式,获取到语音数据对应的文本数据。In step 2410, text data may be collected, or voice data may be collected. At this time, the text data corresponding to the voice data can be obtained by performing semantic recognition on the voice data.
在此基础上,只需要提取历史数据(已经全都转换为文本数据)中的特征词,然后,将这些特征词进行归一化后,整合为特征向量即可。其中,提取出的特征词可以包括但不限于:词频特征。On this basis, it is only necessary to extract the feature words in the historical data (which have all been converted into text data), and then normalize these feature words and integrate them into feature vectors. Among them, the extracted feature words may include, but are not limited to: word frequency features.
步骤2430,利用训练好的风格分类器处理对象特征,得到目标对象的响应风格。Step 2430: Use the trained style classifier to process the object features to obtain the response style of the target object.
在一些实施例中,风格分类器可以用于对历史数据进行风格分类。风格分类器可以训练获取,关于风格分类器的训练和部署参见图11及其相关描述。In some embodiments, the style classifier can be used to classify historical data. The style classifier can be trained and acquired. For the training and deployment of the style classifier, see Figure 11 and its related descriptions.
风格分类器可以在离线端使用,其可以具体表现为一个小参数量的模型。风格分类器的类型可以参见图5及其相关描述。The style classifier can be used offline, and it can be embodied as a model with a small parameter amount. The type of style classifier can be seen in Figure 5 and its related descriptions.
在一些实施例中,风格分类器输出的风格识别结果,可以为多分类结果。示例性的,为便于说明,根据夸奖程度,将响应风格划分为:狠夸(强烈夸赞,夸奖程度较高)、正常夸、微夸(轻微夸赞,夸奖程度较低)三种风格。In some embodiments, the style recognition result output by the style classifier may be a multi-classification result. Exemplarily, for ease of explanation, the response styles are divided into three styles according to the degree of praise: strong praise (strong praise, higher degree of praise), normal praise, and slight praise (slight praise, lower degree of praise).
在此基础上,风格分类器输出的多分类结果,可以用于标识响应风格的风格概率,从而,将风格分类结果所指示的概率最高的一种风格,作为目标对象的响应风格。例如,风格分类器输出的风格识别结果可以为:狠夸70%,正常夸50%,微夸10%,那么,该风格识别结果所指示的响应风格为:狠夸。On this basis, the multi-classification result output by the style classifier can be used to identify the style probability of the response style, so that the style with the highest probability indicated by the style classification result is used as the response style of the target object. For example, the style recognition result output by the style classifier may be: exaggerated 70%, normal exaggerated 50%, and exaggerated 10%. Then, the response style indicated by the style recognition result is: exaggerated.
此外,风格识别结果可以输出具备一个指示标识的多分类结果,此时,风格识别结果所指示的风格为具备指示标识的一种风格。指示标识可以为文字、数字、字符等中的一种或 多种。举例说明。若1为指示标识,风格分类器输出的风格识别结果为:狠夸0,正常夸1,微夸0,那么,该风格识别结果所指示的响应风格为:正常夸。In addition, the style recognition result can output a multi-classification result with one indicator. At this time, the style indicated by the style recognition result is a style with the indicator. The indication mark can be one or more of words, numbers, characters, etc. for example. If 1 is an indicator, and the style recognition result output by the style classifier is: exaggerated 0, normal exaggerated 1, slightly exaggerated 0, then the response style indicated by the style recognition result is: normal exaggeration.
除此之外,风格识别结果还可以输出风格分值,而各风格也分别对应于不同的分数区间,从而,风格识别结果所指示的风格,即为风格分值所落在的分数区间对应的一种风格。In addition, the style recognition results can also output style scores, and each style also corresponds to a different score interval. Therefore, the style indicated by the style recognition result is the score interval corresponding to the style score. A style.
图25是根据本申请一些实施例所示的终端的模块图。在一些实施例中,图5、图10-图11、图23-图24可以在移动设备或终端上(例如,乘客端或司机端等)执行,例如,通过移动设备的处理器340执行。Fig. 25 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5, 10-11, and 23-24 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.
在一些实施例中,获取模块440可以包括接收模块2510。确定模块420可以包括响应风格确定模块2520。响应模块430可以包括响应话术确定模块2530。在一些实施例中,终端2500,包括:接收模块2510、响应风格确定模块2520、响应话术确定模块2530和输出模块1830。In some embodiments, the acquiring module 440 may include a receiving module 2510. The determining module 420 may include a response style determining module 2520. The response module 430 may include a response speech determination module 2530. In some embodiments, the terminal 2500 includes: a receiving module 2510, a response style determining module 2520, a response speech determining module 2530, and an output module 1830.
接收模块2510,用于接收针对目标对象的交互指令。The receiving module 2510 is used to receive interactive instructions for the target object.
响应风格确定模块2520,用于确定目标对象的响应风格,响应风格与目标对象的历史数据相关。在一些实施例中,响应风格确定模块2520可以用于:对获取的历史数据进行处理,得到目标对象的对象特征;以及利用训练好的风格分类器处理对象特征,得到目标对象的响应风格。The response style determination module 2520 is used to determine the response style of the target object, and the response style is related to the historical data of the target object. In some embodiments, the response style determination module 2520 may be used to process the acquired historical data to obtain the object characteristics of the target object; and use the trained style classifier to process the object characteristics to obtain the response style of the target object.
响应话术确定模块2530,用于根据响应风格与交互指令,确定响应话术。The response speech technique determining module 2530 is used to determine the response speech technique according to the response style and the interactive command.
输出模块1830,用于向目标对象输出响应话术。The output module 1830 is used to output response words to the target object.
在一些实施例中,终端2500还包括训练模块(图25未示出),该训练模块用于训练风格分类器。在一些实施例中,训练模块还用于确定风格标签,以及对风格分类器进行更新。In some embodiments, the terminal 2500 further includes a training module (not shown in FIG. 25), which is used to train the style classifier. In some embodiments, the training module is also used to determine the style label and update the style classifier.
在一些实施例中,本说明书中的终端(例如,终端1800、2200或2500)可以是服务器或终端。In some embodiments, the terminal (for example, the terminal 1800, 2200, or 2500) in this specification may be a server or a terminal.
应理解以上图4所示模块图400、图18所示终端1800、图22所示终端2200和图25所述的终端2500的各个模块的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分模块以软件通过处理元件调用的形式实现,部分模块通过硬件的形式实现。例如,提取模块410可以为单独设立的处理元件,也可以集成在终端1800中,例如终端的某一个芯片中实现,此外,也可以以程序的形式存储于终端1800的存储器中,由终端1800的某一个处理元件调用并执行以上各个模块的功能。其它模块的实现与之类似。其它模块的实现与之类似。此外这些模块全部或部分可以集成在 一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。例如,以上这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(Application Specific Integrated Circuit,ASIC),或,一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array,FPGA)等。再如,当以上某个模块通过处理元件调度程序的形式实现时,该处理元件可以是通用处理器,例如中央处理器(Central Processing Unit,CPU)或其它可以调用程序的处理器。再如,这些模块可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。It should be understood that the above module diagram 400 shown in FIG. 4, the terminal 1800 shown in FIG. 18, the terminal 2200 shown in FIG. 22, and the terminal 2500 shown in FIG. All or part of it is integrated into one physical entity, or it can be physically separated. And these modules can all be implemented in the form of software called by processing elements; they can also be implemented in the form of hardware; part of the modules can be implemented in the form of software called by the processing elements, and some of the modules can be implemented in the form of hardware. For example, the extraction module 410 may be a separately established processing element, or it may be integrated in the terminal 1800, for example, implemented in a certain chip of the terminal. In addition, it may also be stored in the memory of the terminal 1800 in the form of a program. A certain processing element calls and executes the functions of the above modules. The implementation of other modules is similar. The implementation of other modules is similar. In addition, all or part of these modules can be integrated together or implemented independently. The processing element described here may be an integrated circuit with signal processing capability. In the implementation process, each step of the above method or each of the above modules can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software. For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as one or more application specific integrated circuits (ASIC), or one or more microprocessors (digital singnal processor, DSP), or, one or more field programmable gate arrays (Field Programmable Gate Array, FPGA), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler, the processing element may be a general-purpose processor, such as a central processing unit (CPU) or other processors that can call programs. For another example, these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述详细披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。The basic concepts have been described above. Obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation to this specification. Although it is not explicitly stated here, those skilled in the art may make various modifications, improvements and amendments to this specification. Such modifications, improvements, and corrections are suggested in this specification, so such modifications, improvements, and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.
同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一个替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" mean a certain feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that “one embodiment” or “one embodiment” or “an alternative embodiment” mentioned twice or more in different positions in this specification does not necessarily refer to the same embodiment. . In addition, some features, structures, or characteristics in one or more embodiments of this specification can be appropriately combined.
此外,除非权利要求中明确说明,本说明书所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的服务器或移动设备上安装所描述的系统。In addition, unless explicitly stated in the claims, the order of processing elements and sequences, the use of numbers and letters, or the use of other names in this specification are not used to limit the order of the processes and methods in this specification. Although the foregoing disclosure uses various examples to discuss some embodiments of the invention that are currently considered useful, it should be understood that such details are only for illustrative purposes, and the appended claims are not limited to the disclosed embodiments. On the contrary, the rights are The requirements are intended to cover all modifications and equivalent combinations that conform to the essence and scope of the embodiments of this specification. For example, although the system components described above can be implemented by hardware devices, they can also be implemented only by software solutions, such as installing the described system on an existing server or mobile device.
同理,应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。For the same reason, it should be noted that, in order to simplify the expressions disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, multiple features are sometimes combined into one embodiment. In the drawings or its description. However, this method of disclosure does not mean that the subject of the specification requires more features than those mentioned in the claims. In fact, the features of the embodiment are less than all the features of the single embodiment disclosed above.
一些实施例中使用了描述成分、属性数量的数字,应当理解的是,此类用于实施例描 述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”来修饰。除非另外说明,“大约”、“近似”或“大体上”表明所述数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值参数均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值参数应考虑规定的有效数位并采用一般位数保留的方法。尽管本说明书一些实施例中用于确认其范围广度的数值域和参数为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。In some embodiments, numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifier "about", "approximately" or "substantially" in some examples. Retouch. Unless otherwise stated, "approximately", "approximately" or "substantially" indicates that the number is allowed to vary by ±20%. Correspondingly, in some embodiments, the numerical parameters used in the specification and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.
针对本说明书引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档等,特此将其全部内容并入本说明书作为参考。与本说明书内容不一致或产生冲突的申请历史文件除外,对本说明书权利要求最广范围有限制的文件(当前或之后附加于本说明书中的)也除外。需要说明的是,如果本说明书附属材料中的描述、定义、和/或术语的使用与本说明书所述内容有不一致或冲突的地方,以本说明书的描述、定义和/或术语的使用为准。For each patent, patent application, patent application publication and other materials cited in this specification, such as articles, books, specifications, publications, documents, etc., the entire contents are hereby incorporated into this specification as a reference. The application history documents that are inconsistent or conflict with the content of this specification are excluded, and the documents that restrict the broadest scope of the claims of this specification (currently or later appended to this specification) are also excluded. It should be noted that if there is any inconsistency or conflict between the description, definition, and/or use of terms in the auxiliary materials of this manual and the content of this manual, the description, definition and/or use of terms in this manual shall prevail. .
最后,应当理解的是,本说明书中所述实施例仅用以说明本说明书实施例的原则。其他的变形也可能属于本说明书的范围。因此,作为示例而非限制,本说明书实施例的替代配置可视为与本说明书的教导一致。相应地,本说明书的实施例不仅限于本说明书明确介绍和描述的实施例。Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other variations may also fall within the scope of this specification. Therefore, as an example and not a limitation, the alternative configuration of the embodiment of the present specification can be regarded as consistent with the teaching of the present specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.
Claims (21)
- 一种人机交互的方法,其特征在于,包括:A method of human-computer interaction, characterized in that it includes:基于针对目标对象的交互指令提取所述目标对象的关联特征,所述关联特征与所述交互指令和所述目标对象的历史数据中的至少一种相关;所述关联特征包括所述交互指令对应的第一特征;若所述交互指令不包括语音数据,所述第一特征包括所述交互指令对应的文本数据的文本特征;若所述交互指令包括语音数据,所述第一特征包括所述语音数据的语音特征和所述交互指令对应的文本数据的文本特征中的至少一种;The associated feature of the target object is extracted based on the interaction instruction for the target object, the associated feature is related to at least one of the interaction instruction and the historical data of the target object; the associated feature includes the interaction instruction correspondence If the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature includes the At least one of a voice feature of the voice data and a text feature of the text data corresponding to the interactive instruction;基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,所述响应策略与响应内容、响应风格和响应情感中的至少一种相关;以及Determining a response strategy for the target object based on processing the associated feature, the response strategy being related to at least one of response content, response style, and response emotion; and基于所述响应策略,确定针对所述目标对象的响应话术。Based on the response strategy, a response language for the target object is determined.
- 根据权利要求1所述的方法,其特征在于,所述语音特征包括:所述语音数据的能量特征和音频特征。The method according to claim 1, wherein the voice characteristics comprise: energy characteristics and audio characteristics of the voice data.
- 根据权利要求2所述的方法,其特征在于,The method of claim 2, wherein:所述音频特征包括:基频特征、短时能量特征、短时赋值特征、短时过零率特征中的至少一种;The audio features include: at least one of a fundamental frequency feature, a short-term energy feature, a short-term assignment feature, and a short-term zero-crossing rate feature;所述能量特征包括:Fbank特征和MFCC特征中的至少一种。The energy characteristics include at least one of Fbank characteristics and MFCC characteristics.
- 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method according to claim 2, wherein the method further comprises:提取所述语音特征之前,对所述语音数据进行预处理,所述预处理包括分帧处理、预增强处理、加窗处理和去噪处理中的至少一种。Before the speech feature is extracted, the speech data is preprocessed, and the preprocessing includes at least one of framing processing, pre-enhancement processing, windowing processing, and denoising processing.
- 根据权利要求1所述的方法,其特征在于,所述交互指令包括语音数据,以及所述基于对所述关联特征进行处理,确定针对所述目标对象的响应策略包括:The method according to claim 1, wherein the interactive instruction includes voice data, and the determining a response strategy for the target object based on processing the associated feature comprises:基于第一分类模型对所述语音特征进行处理,确定所述语音数据中是否包含多个语种中任意一种的指定话术;以及Process the voice features based on the first classification model, and determine whether the voice data contains a specified dialect of any one of multiple languages; and响应于所述语音数据中包含所述指定话术,确定所述指定话术相应的响应内容。In response to the voice data containing the designated speech, a response content corresponding to the designated speech is determined.
- 根据权利要求5所述的方法,其特征在于,所述第一分类模型通过训练过程获取,所述训练过程包括:The method according to claim 5, wherein the first classification model is obtained through a training process, and the training process comprises:获取多个第一训练样本,所述多个第一训练样本为所述多个语种的第一样本语音数据, 所述多个第一训练样本包括正样本与负样本,所述正样本为与所述指定话术相关的第一样本语音数据,所述负样本为与所述指定话术不相关的第一样本语音数据;以及Acquire a plurality of first training samples, the plurality of first training samples are first sample speech data of the plurality of languages, the plurality of first training samples include a positive sample and a negative sample, and the positive sample is A first sample of voice data related to the specified speech, and the negative sample is a first sample of speech data not related to the specified speech; and基于所述多个第一训练样本对初始第一分类模型进行训练,得到所述第一分类模型。Training the initial first classification model based on the multiple first training samples to obtain the first classification model.
- 根据权利要求6所述的方法,其特征在于,所述获取多个第一训练样本包括:The method according to claim 6, wherein said acquiring a plurality of first training samples comprises:基于与所述第一样本语音数据的语种对应的语音转化器,将所述第一样本语音数据转化为第一样本文本数据;Converting the first sample voice data into first sample text data based on a voice converter corresponding to the language of the first sample voice data;基于所述第一样本文本数据,确定所述第一样本语音数据是否与所述指定话术相关;Based on the first sample text data, determine whether the first sample voice data is related to the designated speech;响应于所述第一样本语音数据与所述指定话术相关,将所述第一样本语音数据确定为所述正样本;以及In response to the first sample voice data being correlated with the specified speech, determining the first sample voice data as the positive sample; and响应于所述第一样本语音数据与所述指定话术不相关,将所述第一样本语音数据确定为所述负样本。In response to the first sample voice data being unrelated to the specified speech, the first sample voice data is determined to be the negative sample.
- 根据权利要求1所述的方法,其特征在于,所述交互指令包括语音数据,以及所述基于对所述关联特征进行处理,确定针对所述目标对象的响应策略包括:The method according to claim 1, wherein the interactive instruction includes voice data, and the determining a response strategy for the target object based on processing the associated feature comprises:基于第二分类模型对所述语音特征和所述文本特征中至少一个进行处理,确定所述语音数据的被响应情感;以及Processing at least one of the voice feature and the text feature based on the second classification model to determine the respondent emotion of the voice data; and基于所述被响应情感,确定所述响应情感。Based on the responded emotion, the response emotion is determined.
- 根据权利要求8所述的方法,其特征在于,所述基于对所述语音特征和所述文本特征中至少一个进行处理,确定所述语音数据的被响应情感包括:The method according to claim 8, wherein the determining the respondent emotion of the voice data based on processing at least one of the voice feature and the text feature comprises:基于所述第二分类模型对所述语音特征进行处理,确定第一情感;Processing the voice feature based on the second classification model to determine the first emotion;基于对所述文本特征进行处理,确定第二情感;以及Determine the second emotion based on processing the text feature; and基于所述第一情感和/或第二情感,确定所述被响应情感。Based on the first emotion and/or the second emotion, the responded emotion is determined.
- 根据权利要求9所述的方法,其特征在于,所述文本特征包括关键词特征,所述关键词特征包括情感关联词;The method according to claim 9, wherein the text features include keyword features, and the keyword features include emotional related words;所述基于对所述文本特征进行处理,确定第二情感包括:The determining the second emotion based on processing the text feature includes:基于所述情感关联词,确定所述第二情感。Based on the emotion-related words, the second emotion is determined.
- 根据权利要求9所述的方法,其特征在于,所述第二分类模型通过训练过程获取, 所述训练过程包括:The method according to claim 9, wherein the second classification model is obtained through a training process, and the training process comprises:获取多个第二训练样本,所述多个第二训练样本中的每一个包含第二样本语音数据及其对应的情感标签,所述情感标签代表对所述第二样本语音数据的样本响应情感;以及Acquire a plurality of second training samples, each of the plurality of second training samples includes a second sample speech data and its corresponding emotion label, the emotion label represents the sample response emotion to the second sample speech data ;as well as基于所述多个第二训练样本对初始第二分类模型进行训练,得到所述第二分类模型。Training the initial second classification model based on the multiple second training samples to obtain the second classification model.
- 根据权利要求1所述的方法,其特征在于,所述关联特征包括所述历史数据对应的第二特征;The method according to claim 1, wherein the associated feature includes a second feature corresponding to the historical data;所述基于对所述关联特征进行处理,确定针对所述目标对象的响应策略包括:The determining a response strategy for the target object based on processing the associated feature includes:基于第三分类模型对所述第一特征和所述第二特征中的至少一个进行处理,确定所述响应风格。At least one of the first feature and the second feature is processed based on a third classification model to determine the response style.
- 根据权利要求12所述的方法,其特征在于,所述基于第三分类模型对所述第一特征和所述第二特征中的至少一个进行处理,确定所述目标对象的响应风格包括:The method according to claim 12, wherein the processing at least one of the first feature and the second feature based on a third classification model, and determining the response style of the target object comprises:基于所述第三分类模型对所述第一特征和所述第二特征中的至少一种进行处理,确定针对所述目标对象的第一风格;Processing at least one of the first feature and the second feature based on the third classification model to determine a first style for the target object;基于对所述文本特征进行处理,确定针对所述目标对象的第二风格;以及Determining a second style for the target object based on processing the text feature; and基于所述第一风格和所述第二风格中的至少一种,确定所述响应风格。The response style is determined based on at least one of the first style and the second style.
- 根据权利要求12所述的方法,其特征在于,所述第三分类模型通过训练过程获取,所述训练过程包括:The method according to claim 12, wherein the third classification model is obtained through a training process, and the training process comprises:获取多个第三训练样本,所述多个第三训练样本中的每一个包含针对样本目标对象的样本交互指令、所述样本目标对象的样本历史数据、以及对应的风格标签,所述风格标签代表针对所述样本目标对象的样本响应风格;以及Acquire a plurality of third training samples, each of the plurality of third training samples includes a sample interaction instruction for a sample target object, sample history data of the sample target object, and a corresponding style label, the style label Represents the sample response style for the sample target object; and基于所述多个第三训练样本对初始第三分类模型进行训练,得到所述第三分类模型。Training the initial third classification model based on the plurality of third training samples to obtain the third classification model.
- 根据权利要求14所述的方法,其特征在于,所述获取多个第三训练样本包括:The method according to claim 14, wherein said acquiring a plurality of third training samples comprises:获取所述样本目标对象针对样本响应话术的反馈数据,所述样本响应话术基于所述样本交互指令确定;Acquiring feedback data of the sample target object for sample response words, the sample response words being determined based on the sample interaction instruction;获取所述样本目标对象的信誉数据;以及Obtaining reputation data of the sample target object; and基于所述信誉数据和/或所述反馈数据,确定所述风格标签。Based on the reputation data and/or the feedback data, the style label is determined.
- 根据权利要求1所述的方法,其特征在于,所述历史数据包括:The method according to claim 1, wherein the historical data comprises:个人账户数据、行为数据和离线记录数据中的至少一种。At least one of personal account data, behavior data, and offline record data.
- 根据权利要求1所述的方法,其特征在于,所述基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,包括:The method according to claim 1, wherein the determining a response strategy for the target object based on processing the associated feature comprises:通过模型对所述关联特征进行处理,确定针对所述目标对象的响应策略;所述模型由多层残差网络与多层全连接网络构成,所述多层残差网络由卷积神经网络构成;或由多层卷积神经网络与多层全连接网络构成。The associated features are processed through a model to determine a response strategy for the target object; the model is composed of a multi-layer residual network and a multi-layer fully connected network, and the multi-layer residual network is composed of a convolutional neural network ; Or composed of a multilayer convolutional neural network and a multilayer fully connected network.
- 根据权利要求1所述的方法,其特征在于,所述目标对象为当前终端所属用户;或所述目标对象为与所述当前终端存在通信的对方用户。The method according to claim 1, wherein the target object is a user to which the current terminal belongs; or the target object is a counterpart user who communicates with the current terminal.
- 根据权利要求1所述的方法,其特征在于,所述响应话术通过响应文本和/或响应语音的方式输出给所述目标对象。The method according to claim 1, wherein the response speech is output to the target object in a response text and/or voice response.
- 一种人机交互的系统,其特征在于,包括:A human-computer interaction system is characterized in that it includes:提取模块,用于基于针对目标对象的交互指令提取所述目标对象的关联特征,所述关联特征与所述交互指令和所述目标对象的历史数据中的至少一种相关;所述关联特征包括所述交互指令对应的第一特征;若所述交互指令不包括语音数据,所述第一特征包括所述交互指令对应的文本数据的文本特征;若所述交互指令包括语音数据,所述第一特征包括所述语音数据的语音特征和所述交互指令对应的文本数据的文本特征中的至少一种;The extraction module is configured to extract the associated feature of the target object based on an interactive instruction directed to the target object, the associated feature being related to at least one of the interactive instruction and historical data of the target object; the associated feature includes The first feature corresponding to the interactive instruction; if the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature A feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction;确定模块,用于基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,所述响应策略与响应内容、响应风格和响应情感中的至少一种相关;以及A determining module, configured to determine a response strategy for the target object based on processing the associated characteristics, where the response strategy is related to at least one of response content, response style, and response emotion; and响应模块,用于基于所述响应策略,确定针对所述目标对象的响应话术。The response module is used to determine the response language for the target object based on the response strategy.
- 一种计算机可读存储介质,所述存储介质存储计算机指令,当计算机读取存储介质中的计算机指令后,计算机执行如下人机交互的方法;A computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the following human-computer interaction method;基于针对目标对象的交互指令提取所述目标对象的关联特征,所述关联特征与所述交互指令和所述目标对象的历史数据中的至少一种相关;所述关联特征包括所述交互指令对应的第一特征;若所述交互指令不包括语音数据,所述第一特征包括所述交互指令对应的文本数据的文本特征;若所述交互指令包括语音数据,所述第一特征包括所述语音数据的语音特征 和所述交互指令对应的文本数据的文本特征中的至少一种;The associated feature of the target object is extracted based on the interaction instruction for the target object, the associated feature is related to at least one of the interaction instruction and the historical data of the target object; the associated feature includes the interaction instruction correspondence If the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature includes the At least one of a voice feature of the voice data and a text feature of the text data corresponding to the interactive instruction;基于对所述关联特征进行处理,确定针对所述目标对象的响应策略,所述响应策略与响应内容、响应风格和响应情感中的至少一种相关;以及Determining a response strategy for the target object based on processing the associated feature, the response strategy being related to at least one of response content, response style, and response emotion; and基于所述响应策略,确定针对所述目标对象的响应话术。Based on the response strategy, a response language for the target object is determined.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010018047.6A CN111833854B (en) | 2020-01-08 | 2020-01-08 | Man-machine interaction method, terminal and computer readable storage medium |
CN202010017735.0 | 2020-01-08 | ||
CN202010017735.0A CN111833907B (en) | 2020-01-08 | 2020-01-08 | Man-machine interaction method, terminal and computer readable storage medium |
CN202010016725.5 | 2020-01-08 | ||
CN202010018047.6 | 2020-01-08 | ||
CN202010016725.5A CN111833865B (en) | 2020-01-08 | 2020-01-08 | Man-machine interaction method, terminal and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021139737A1 true WO2021139737A1 (en) | 2021-07-15 |
Family
ID=76787752
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/070720 WO2021139737A1 (en) | 2020-01-08 | 2021-01-07 | Method and system for man-machine interaction |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021139737A1 (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683672A (en) * | 2016-12-21 | 2017-05-17 | 竹间智能科技(上海)有限公司 | Intelligent dialogue method and system based on emotion and semantics |
CN109272984A (en) * | 2018-10-17 | 2019-01-25 | 百度在线网络技术(北京)有限公司 | Method and apparatus for interactive voice |
CN109587360A (en) * | 2018-11-12 | 2019-04-05 | 平安科技(深圳)有限公司 | Electronic device should talk with art recommended method and computer readable storage medium |
CN109979457A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A method of thousand people, thousand face applied to Intelligent dialogue robot |
US20190295533A1 (en) * | 2018-01-26 | 2019-09-26 | Shanghai Xiaoi Robot Technology Co., Ltd. | Intelligent interactive method and apparatus, computer device and computer readable storage medium |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111833854A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111833865A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
-
2021
- 2021-01-07 WO PCT/CN2021/070720 patent/WO2021139737A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106683672A (en) * | 2016-12-21 | 2017-05-17 | 竹间智能科技(上海)有限公司 | Intelligent dialogue method and system based on emotion and semantics |
US20190295533A1 (en) * | 2018-01-26 | 2019-09-26 | Shanghai Xiaoi Robot Technology Co., Ltd. | Intelligent interactive method and apparatus, computer device and computer readable storage medium |
CN109272984A (en) * | 2018-10-17 | 2019-01-25 | 百度在线网络技术(北京)有限公司 | Method and apparatus for interactive voice |
CN109587360A (en) * | 2018-11-12 | 2019-04-05 | 平安科技(深圳)有限公司 | Electronic device should talk with art recommended method and computer readable storage medium |
CN109979457A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A method of thousand people, thousand face applied to Intelligent dialogue robot |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111833854A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111833865A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112259106B (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
US10977452B2 (en) | Multi-lingual virtual personal assistant | |
US11908468B2 (en) | Dialog management for multiple users | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
JP7022062B2 (en) | VPA with integrated object recognition and facial expression recognition | |
CN111312245B (en) | Voice response method, device and storage medium | |
KR100586767B1 (en) | System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input | |
CN112069484A (en) | Multi-mode interactive information acquisition method and system | |
US11574637B1 (en) | Spoken language understanding models | |
WO2021047319A1 (en) | Voice-based personal credit assessment method and apparatus, terminal and storage medium | |
KR20210070213A (en) | Voice user interface | |
CN109119069B (en) | Specific crowd identification method, electronic device and computer readable storage medium | |
CN111383138B (en) | Restaurant data processing method, device, computer equipment and storage medium | |
CN110459242A (en) | Change of voice detection method, terminal and computer readable storage medium | |
CN114127849A (en) | Speech emotion recognition method and device | |
WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
CN109947971A (en) | Image search method, device, electronic equipment and storage medium | |
US11437043B1 (en) | Presence data determination and utilization | |
CN117352000A (en) | Speech classification method, device, electronic equipment and computer readable medium | |
CN114373443A (en) | Speech synthesis method and apparatus, computing device, storage medium, and program product | |
CN110809796B (en) | Speech recognition system and method with decoupled wake phrases | |
CN117690456A (en) | Small language spoken language intelligent training method, system and equipment based on neural network | |
CN115132195B (en) | Voice wakeup method, device, equipment, storage medium and program product | |
WO2021139737A1 (en) | Method and system for man-machine interaction | |
CN113053409B (en) | Audio evaluation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21738696 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21738696 Country of ref document: EP Kind code of ref document: A1 |