WO2021139737A1

WO2021139737A1 - Method and system for man-machine interaction

Info

Publication number: WO2021139737A1
Application number: PCT/CN2021/070720
Authority: WO
Inventors: 孙建伟; 赵帅江
Original assignee: 北京嘀嘀无限科技发展有限公司
Priority date: 2020-01-08
Filing date: 2021-01-07
Publication date: 2021-07-15

Abstract

Disclosed in an embodiment of the present invention is a method for man-machine interaction. The method for man-machine interaction comprises: extracting an associated feature of a target object on the basis of an interaction instruction directed at the target object, the associated feature being related to at least one of the interaction instruction and historical data of the target object; determining a response policy for the target object by processing the associated feature, the response policy being related to at least one of response content, a response style and a response emotion; and determining a response message for the target object on the basis of the response policy.

Description

Method and system for human-computer interaction

cross reference

This application requires the priority of the Chinese application with the application number 202010016725.5 filed on January 8, 2020, the Chinese application with the application number 202010017735.0 filed on January 8, 2020, and the application number 202010018047.6 filed on January 8, 2020. For the priority of the Chinese application, all the contents of which are included here by reference.

Technical field

This specification relates to the field of computer technology, in particular to a human-computer interaction method and system.

Background technique

With the development of computer technology, terminals can realize automatic responses to users to realize human-computer interaction. Generally, the terminal can determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user.

However, in the current human-computer interaction, it may only be possible to give a response in a single language (for example, Chinese or English). Moreover, when responding, the user's emotional needs or style requirements are often not considered. For example, the terminal generally responds according to the default tone, tone or content, but it cannot meet the user's personalized interaction needs.

For this reason, the embodiment of this specification proposes a human-computer interaction method and system to realize personalized interaction for different languages.

Summary of the invention

One of the embodiments of this specification provides a method of human-computer interaction, the method includes: extracting an associated feature of the target object based on an interaction instruction directed to the target object, and the associated feature is related to the interaction instruction and the target object At least one of the historical data; the associated feature includes the first feature corresponding to the interactive instruction; if the interactive instruction does not include voice data, the first feature includes the text data corresponding to the interactive instruction Text features; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; based on the correlation feature Processing, determining a response strategy for the target object, the response strategy being related to at least one of response content, response style, and response emotion; and, based on the response strategy, determining a response strategy for the target object .

One of the embodiments of the present specification provides a human-computer interaction system, the system includes: an extraction module for extracting the associated feature of the target object based on the interaction instruction for the target object, the associated feature and the interaction instruction Related to at least one of the historical data of the target object; the associated feature includes the first feature corresponding to the interaction instruction; if the interaction instruction does not include voice data, the first feature includes the interaction instruction The text feature of the corresponding text data; if the interactive instruction includes voice data, the first feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction; determining module , For determining a response strategy for the target object based on processing the associated features, the response strategy being related to at least one of response content, response style, and response emotion; and, a response module for determining a response based on The response strategy determines the response words for the target object.

One of the embodiments of the present application provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned human-computer interaction method.

Description of the drawings

This specification will be further described in the form of exemplary embodiments, and these exemplary embodiments will be described in detail with the accompanying drawings. These embodiments are not restrictive. In these embodiments, the same number represents the same structure, in which:

Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application;

Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application;

Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application;

Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application;

Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Fig. 7 is an exemplary flow chart of training a first classification model according to some embodiments of the present application;

Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application;

Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application;

FIG. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application;

FIG. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application;

FIG. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application;

FIG. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application;

FIG. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application;

Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Figure 18 is a block diagram of a terminal according to some embodiments of the present application;

Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

FIG. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application;

FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application;

Figure 22 is a block diagram of a terminal according to some embodiments of the present application;

FIG. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

FIG. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application;

Fig. 25 is a block diagram of a terminal according to some embodiments of the present application.

Detailed ways

In order to more clearly describe the technical solutions of the embodiments of the present specification, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some examples or embodiments of this specification. For those of ordinary skill in the art, without creative work, this specification can also be applied to these drawings. Other similar scenarios. Unless it is obvious from the language environment or otherwise stated, the same reference numerals in the figures represent the same structure or operation.

It should be understood that “system”, “device”, “unit” and/or “module” used herein is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As shown in this specification and claims, unless the context clearly indicates exceptions, the words "a", "an", "an" and/or "the" do not specifically refer to the singular, but may also include the plural. Generally speaking, the terms "include" and "include" only suggest that the clearly identified steps and elements are included, and these steps and elements do not constitute an exclusive list, and the method or device may also include other steps or elements.

In this specification, a flowchart is used to illustrate the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed exactly in order. Instead, the steps can be processed in reverse order or at the same time. At the same time, other operations can be added to these processes, or a certain step or several operations can be removed from these processes.

Fig. 1 is a schematic diagram of an application scenario of a human-computer interaction system according to some embodiments of the present application. As shown in FIG. 1, the human-computer interaction system 100 may include a server 110, a network 120, a first client 130, a second client 140 and a storage 150.

The server 110 may process data and/or information obtained from at least one component of the system 100 (for example, the first client 130, the second client 140, and the storage 150) or an external data source (for example, a cloud data center). For example, the server 110 may obtain the interaction instruction from the first user terminal 130 (for example, the passenger terminal). For another example, the server 110 may also obtain historical data from the storage 150.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process information and/or data related to the human-computer interaction system to perform one or more functions described in this specification. For example, the processing device 112 may determine response speech based on interactive instructions and/or historical data. In some embodiments, the processing device 112 may include at least one processing unit (for example, a single-core processing engine or a multi-core processing engine). In some embodiments, the processing device 112 may be a part of the first client 130 and/or the second client 140.

The network 120 may provide channels for information exchange. In some embodiments, the network 120 may include one or more network access points. One or more components of the system 100 may be connected to the network 120 through an access point to exchange data and/or information. In some embodiments, at least one component in the system 100 can access data or instructions stored in the memory 150 via the network 120.

The owner of the first user terminal 130 may be the user himself or someone other than the user himself. For example, the owner A of the first client 130 may use the first client 130 to send a service request for the user B. In some embodiments, the first client 130 may include various types of devices with information receiving and/or sending functions. The first client 130 can process information and/or data. In some embodiments, the first client 130 may be a device with a positioning function. The first client 130 may be a device with a display function, and the response words fed back to the first client 130 by the display server 110 may be displayed in an interface, pop-up window, floating window, small window, text, etc. The first client 130 may be a device with a voice function, so that the response words fed back by the server 110 to the first client 130 can be played.

The second client 140 can communicate with the first client 130. In some embodiments, the first client 130 and the second client 140 may communicate through a short-range communication device. In some embodiments, the type of the second client 140 may be the same as or different from that of the first client 130. For example, the first client 130 and the second client 140 may include, but are not limited to, a tablet computer, a notebook computer, a mobile device, a desktop computer, etc., or any combination thereof.

In some embodiments, the memory 150 may store data and/or instructions that can be executed or used by the processing device 112 to complete the exemplary methods described in this specification. For example, the memory 150 may store historical data, a model used to determine the response speech, an audio file of the response speech, a text file, and the like. In some embodiments, the storage 150 may be directly connected to the server 110 as a back-end storage. In some embodiments, the storage 150 may be a part of the server 110, the first client 130 and/or the second client 140.

Fig. 2 is a schematic diagram of exemplary hardware components and/or software components of an exemplary computing device according to some embodiments of the present application. As shown in FIG. 2, the computing device 200 may include a processor 210, a memory 220, an input/output 230, and a communication port 240.

The processor 210 can execute calculation instructions (program code) and perform the functions of the human-computer interaction system 100 described in the present invention. The calculation instructions may include programs, objects, components, data structures, procedures, modules, and functions (the functions refer to the specific functions described in the present invention). For example, the processor 210 may process image or text data obtained from any other components of the human-computer interaction system 100. For illustration only, the computing device 200 in FIG. 2 only describes one processor, but it should be noted that the computing device 200 in the present invention may also include multiple processors. The memory 220 may store data/information obtained from any other components of the on-demand service system 100. The input/output 230 may be used to input or output signals, data or information. In some embodiments, the input/output 230 may enable the user to communicate with the human-computer interaction system 100. In some embodiments, the input/output 230 may include an input device and an output device. The communication port 240 may be connected to a network for data communication. In some embodiments, the communication port 240 may be a standardized port or a specially designed port.

Fig. 3 is a schematic diagram of exemplary hardware components and/or software components of an exemplary mobile device according to some embodiments of the present application.

In some embodiments, the terminal (for example, the first user terminal 130 and the second user terminal 140) may be implemented by the mobile device 300. As shown in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an input/output 350, a memory 360, and a storage 390. In some embodiments, the mobile device 300 may also include any other suitable components, including but not limited to a system bus or a controller (not shown in the figure). In some embodiments, the mobile operating system 370 and one or more application programs 380 may be loaded from the memory 390 into the memory 360 so as to be executable by the central processing unit 340. The application program 380 may include a browser or any other suitable mobile application program for receiving and presenting prompt information or other information related information from the server 110. The user interaction of the information flow can be implemented through the input/output 350 and provided to the server 110 and/or other components of the human-computer interaction system 100 through the network 120.

Fig. 4 is a block diagram of an exemplary processing device according to some embodiments of the present application. As shown in FIG. 4, the system may include: an extraction module 410, a determination module 420, a response module 430, and an acquisition module 440.

The extraction module 410 may be used to extract the associated features of the target object based on the interactive instruction for the target object. The associated feature includes a first feature corresponding to an interactive instruction and/or a second feature corresponding to historical data. The first feature includes at least one of a voice feature of the voice data in the interactive instruction and/or a text feature of the text data corresponding to the interactive instruction. . In some embodiments, the extraction module 410 may be used to preprocess the interaction instructions before extracting the associated features. See Figure 5 and related descriptions for more details.

The determining module 420 may be used to determine a response strategy for the target object based on processing the associated features. In some embodiments, the determining module 420 may be used to process the associated features based on the model and determine a response strategy. The model may be a first classification model, a second classification model, and a third classification model. See Figures 5, 6, 8 and 10 and related descriptions for more details.

The response module 430 may be used to determine and output response words for the target object based on the response strategy. In some embodiments, the response words are output to the target object by responding to text and/or responding to speech.

The obtaining module 440 may be used to obtain interactive instructions and models. For example, the obtaining module 440 may obtain the model through a training process. The obtaining module 440 may be used to obtain training samples. The training sample includes a first training sample, a second training sample, and a third training sample. See Figures 7, 9 and 11 and related descriptions for more details.

Fig. 5 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 500 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 510: Extract the associated feature of the target object based on the interaction instruction for the target object, where the associated feature is related to at least one of the interaction instruction and historical data of the target object. In some embodiments, step 510 may be performed by the extraction module 410.

The target audience refers to people who can communicate information with systems or devices (for example, mobile devices, terminal devices, etc.). For example, the target object refers to the object that the system or device needs to respond to. Target objects may include device-associated objects (for example, users or communicators), debuggers, testers, implementation personnel, maintenance personnel, customer service personnel, and so on.

In some embodiments, the target object is a user to which the current terminal belongs. For example, the passenger user belongs to the passenger terminal, the passenger user belongs to the passenger terminal, and so on. The current terminal may refer to a device or system that performs human-computer interaction, for example, it may be a terminal that receives interactive instructions. It is understandable that when the target object is the user to which the current terminal belongs, the target object is actually the user who initiated the interactive instruction. In some embodiments, the target object is a counterpart user who has communication with the current terminal. For example, a driver user who communicates with the passenger terminal and a passenger user who communicates with the driver terminal.

Interactive instructions are instructions sent to the device. An interactive instruction for the target object may refer to an instruction sent to a system or device so that it can determine how to give a response to the target object. In some embodiments, it may be an interaction instruction sent to the device by an associated object of the device (such as a user of the device). The associated objects can convey specific intentions to the device through interactive instructions (for example, praise, praise, praise the driver, praise oneself, complain, etc.), so that the device or system can give a corresponding response. In some embodiments, the interactive instruction can be obtained from the current terminal.

In some embodiments, the interactive instruction may be in the form of voice, text, video, image, facial motion, gesture, touch screen operation, etc., and any combination thereof. Among them, the voice can be one or any combination of Chinese, English, French, Japanese, etc.

The associated feature may refer to the feature related to the target object, the sender of the interactive instruction, and/or the interactive instruction. In some embodiments, the association feature may be related to at least one of the interaction instruction and the historical data of the target object.

In some embodiments, the associated feature includes the first feature corresponding to the interactive instruction. It can be understood that the first feature is a feature obtained based on an interactive instruction.

As mentioned above, there can be multiple forms of the interactive instruction, and correspondingly, the characteristics corresponding to the interactive instruction can also be multiple. In some embodiments, the first feature may include at least one of the voice feature of the voice data in the interaction instruction, the text feature of the text data corresponding to the interaction instruction, and the image feature corresponding to the image data in the interaction instruction. The text feature of the text data corresponding to the interactive instruction can be the text data contained in the interactive instruction itself, or the text data obtained by recognizing voice data and/or other data (for example, image data) in the interactive instruction, etc. . For example, it can be recognized by a speech decoder. As mentioned above, the voice data can be multilingual voice data. Correspondingly, the voice decoder can have a one-to-many or one-to-one relationship with the language, that is, a voice decoder can convert the voice data of multiple languages into Text data, or, a speech decoder can only decode speech data in a certain language.

In some embodiments, the interactive instruction may include voice data, and the first feature may include a voice feature corresponding to the voice data and/or a text feature of text data corresponding to the interactive instruction. As mentioned above, the voice may be multilingual, the voice data may be data corresponding to multiple languages, and the voice feature may be the voice feature corresponding to the multilingual voice data.

In some embodiments, the interaction data may not include voice data, and the first feature may include the text feature of the text data corresponding to the interaction instruction.

In some embodiments, the first feature may also include features corresponding to other data in the interactive instruction. For example, if the interactive instruction includes image data, the first feature includes image features corresponding to the image data. For another example, if the interaction instruction includes gestures, postures, and other actions, the first feature may include screen gesture features, facial features, fingerprint features, and the like. Screen gesture features represent screen operation information in interactive instructions, such as operations such as sliding, turning pages, and touching. The facial features represent the facial features of the user in the interactive instructions. For example, the processing device can obtain different interactive instructions according to different facial features. The facial features can also include pupil features, facial features, iris features, and the like. The fingerprint feature represents the fingerprint information of the user's finger. For example, the processing device can obtain different interaction commands according to different fingerprint features.

In some embodiments, the voice feature includes one or a combination of audio features and energy features of the voice data.

The audio feature of the voice data refers to the audio feature of the voice data. In some embodiments, the audio feature may include at least one of a fundamental frequency feature, a short-term energy feature, a short-term assignment feature, a short-term zero-crossing rate feature, and the like.

The fundamental frequency characteristic refers to the characteristic of the sound frequency in the speech data. Among them, the fundamental frequency corresponds to the frequency of vocal cord vibration and represents the pitch of the sound. The faster the vocal cord vibration, the higher the fundamental frequency. The fundamental frequency characteristics of speech data can be used to detect speech noise, special sound detection, gender discrimination, speaker recognition, parameter adaptation, and so on.

The short-term energy feature refers to the average energy gathered by the sampling point signal in a short-term audio frame. For example, a continuous audio signal stream x gets K sampling points, so that K sampling points can be divided into M short-term frames, and the size of each short-term frame and the window function is assumed to be N. For the m-th short-term frame, According to the formula

Calculate its short-term energy.

The short-time zero-crossing rate feature refers to the number of times the signal passes the zero value in each frame. For continuous speech signals with a horizontal axis of time, it can be observed that the time-domain waveform of the speech passes through the horizontal axis. In the case of discrete-time speech signals, if adjacent samples have different algebraic signs, it is called zero-crossing, so the number of zero-crossings can be calculated, that is, the zero-crossing rate. The zero-crossing rate can reflect the frequency information of the signal to a certain extent. The short-time zero-crossing rate can be used to judge the unvoiced and voiced speech data. A high zero-crossing rate means unvoiced sound, and a low zero-crossing rate means voiced sound.

The energy feature of voice data refers to the energy distribution in the frequency domain of the voice data, and different energy distributions represent different voice features. Among them, the frequency domain is a coordinate system used to describe the frequency characteristics of the speech signal. In some embodiments, the frequency domain diagram may display the energy value of the speech signal in each given frequency band within a frequency range.

In some embodiments, the energy feature includes at least one of Fbank feature and mel-frequency cepstrum (MFCC) feature.

The MFCC feature refers to the feature of the voice signal obtained by the MFCC method. MFCC features have a good degree of discrimination and are used to identify different sounds. In some embodiments, MFCC features are commonly used for automatic speech and speaker recognition. Regarding the Fbank feature and its extraction, please refer to FIG. 17 and its related description for details, which will not be repeated here.

In some embodiments, the voice features further include linear prediction analysis (Linear Prediction Coefficients, LPC), perceptual linear prediction coefficients (Perceptual Linear Predictive, PLP), Tandem features, Botleneck features, Linear Predictive Cepstral Coefficients (Linear Predictive Cepstral Coefficient, LPCC), resonance waves, Bark Spectrum and so on.

In some embodiments, speech features can be extracted through algorithms or models. The algorithm corresponding to the voice feature type is used to extract the voice feature of the corresponding type. For example, the MFCC feature is extracted through the triangular band-pass filter algorithm.

In some embodiments, before extracting the voice feature, the voice data may be pre-processed, and the pre-processing includes at least one of framing processing, pre-enhancement processing, windowing processing, and noise processing.

Framing processing is used to divide voice data into multiple voice segments, reducing the amount of data processed each time. During the framing process, it can be divided according to a predetermined value or a predetermined range (for example, 10 ms to 30 ms is a frame). In order to avoid omissions, an offset can be performed during frame division, so that there is an overlap between two adjacent frames. In some embodiments, the voice data is a short sentence, and some scenes do not need to be segmented.

Pre-enhancement processing is used to enhance the high frequency part. Pre-enhancement can be achieved by passing the voice data through a high-pass filter.

Windowing is used to eliminate signal discontinuity that may be caused at both ends of each frame. For example, multiply each frame by the Hamming window to increase the continuity between the left and right ends of the frame.

The noise processing may be processing of adding random noise, which can solve the processing errors and omissions of the synthesized audio. Noise processing may also include noise reduction processing. The noise reduction processing can be achieved by noise reduction algorithms, which can include adaptive filters, spectral subtraction, Wiener filtering, and so on. In some embodiments, the voice data is data collected in real time, and noise processing is not required in some scenarios.

The text features represent relevant information of the text data, including but not limited to: keyword features, semantic features, word frequency features, etc. In some embodiments, the text features may be extracted through algorithms or models, for example, through LSTM, BERT, one-hot, bag-of-words model, term frequency and inverse document frequency (term frequency—inverse document frequency, TF-IDF) model, vocabulary model, etc.

The historical data of the target object refers to the data generated by the target object in the past period of time, for example, the last week, the last month, or the last three days. The historical data may include, but is not limited to: one or more of online voice data, personal account data, user behavior data, or offline record data.

Online voice data comes from the online voice of the target object, which is the voice of the target object online. For example, online voices made in the past period of time. For example, it may be a voice interaction instruction in which the target object requested the device to give a response in the past. For another example, it may be the voice data of the target object communicating with the communication user in the past. In some embodiments, online voice data can be converted into text data as historical data. For example, after acquiring the historical online voice of the target object and performing voice recognition on it, the corresponding text is obtained, and the text is used as a kind of historical data.

The personal account data comes from the account information of the target object to which the terminal belongs. The personal account data may include, but is not limited to: personality, occupation, number of orders using taxi apps (Application, application), reputation score, gender, age of account use, The number of historical orders, the pick-up and drop-off points in historical orders, the taxi time information in historical orders, etc.

The target object behavior data may be data generated by historical operations or feedback of the target object. For example, it may be to obtain the evaluation feedback data of the target object's response to the historical push. For another example, it may be the evaluation information of the target object (for example, the driver or the passenger) to the correspondent (for example, the passenger or the driver). For another example, it may include the user's evaluation of historical orders, chat records with customer service, evaluation of customer service, evaluation of the system, evaluation of information pushed by the system, and other information.

The offline recorded data may be data recorded by the terminal. For example, it can be data recorded offline by the terminal. For example, the historical voice recognition or input text of the target object recorded on the offline side.

In some embodiments, the historical data may also include other information, including but not limited to the user's historical input information, the user's geographic information, and the user's identity information. The user's historical input information includes the user's historical query information, video input information, voice input information, text input information, screen gesture operation information, unlocking or unlocking information, and so on. The user's geographic information includes the user's home address, work address, activity range, etc. The user's identity information includes the user's age, work, hometown, height, weight, income, etc.

It should be understood that for any target object, its historical data is constantly updated over time.

The second characteristic may be a characteristic determined based on historical data, including: a characteristic determined based on the historical data of the target object and a characteristic determined based on the historical data of the sender of the interactive instruction. In some embodiments, the type of the second feature can be determined according to the type of historical data. For example, if the historical data contains voice data, the second feature contains the voice feature of the historical data. Correspondingly, the extraction method of the second feature is the same as that of the first feature. The features are similar, so I won’t repeat them here. In some embodiments, time series behavior characteristics can also be extracted based on historical data, the historical data of the target object or the sender of the interactive instruction can be converted into sequence characteristics in the order of time, and one or more specific behaviors in the historical data can be converted into sequence characteristics. Each feature behavior is represented by a vector in the sequence feature. For travel scenarios, specific behaviors include service-related behaviors such as the number of orders placed, and the degree of praise. Therefore, when determining the response strategy based on the second feature, not only the real behavior can be considered, but also the time factor can be considered, so that the evaluation of the target object or the interactive instruction issuer is more accurate, and further, the determination of the response strategy is also more accurate. Is accurate.

In some embodiments, the associated features can be represented by vectors. In some embodiments, after the associated features are extracted, post-processing may be performed on the associated features, for example, normalization processing, etc. For more details about the normalization process, please refer to Figure 20 and its related descriptions.

Step 520: Determine a response strategy for the target object based on processing the associated features, where the response strategy is related to at least one of response content, response style, and response emotion. In some embodiments, step 520 may be performed by the determining module 420.

The response strategy refers to a method and/or criterion for responding to interactive instructions, and the response strategy is related to at least one of response content, response style, and response emotion.

The response content refers to the semantic content of the response to the interactive command. In some embodiments, the same response content can be expressed in different language content. For example, the response content is "the driver's service is very good", which can be expressed as "the driver's service is very good", "the driver's service is great", "the driver's service is really good", etc.

The response style refers to the degree of response to interactive commands. For example, when complimenting the other party, there can be: strong praise (strong praise, high praise), normal praise, slight praise (slight praise, low praise), etc.

Responding emotions refers to the emotions or moods associated with responding to interactive instructions. Response emotions can include loss, calm, enthusiasm, joy, excitement, joy, excitement, sadness, anger, irritability, pain, or passion.

In some embodiments, the associated features can be processed through the model to determine the response strategy for the target object. The model is composed of a multi-layer residual network and a multi-layer fully connected network. The multi-layer residual network is composed of a convolutional neural network; or a multi-layer convolutional neural network and a multi-layer fully connected network. Among them, the models may be the first classification model, the second classification model, and the third classification model. For details, see the following text.

In some embodiments, the determination module 420 may process the associated features and determine the response content corresponding to the specified speech. For example, the determining module 420 may process the voice features based on the first classification model, and determine whether the voice data contains a specified dialect in any one of the multiple languages; if the voice data contains a specified dialect, determine the corresponding corresponding to the specified dialect Response content. For details, refer to Fig. 6 and its description, which will not be repeated here.

In some embodiments, the determining module 420 may determine the response emotion based on processing the associated features. For example, the determining module 420 may process at least one of the voice feature and the text feature based on the second classification model to determine the response emotion of the voice data; and based on the response emotion, determine the response emotion. For details, refer to FIG. 13 and its description, which will not be repeated here.

In some embodiments, the determining module 420 may determine the response style based on processing the associated features. For example, the determining module may process at least one of the first feature and the second feature based on the third classification model to determine the response style. Refer to Figure 15 and its description for details, and will not be repeated here.

In some embodiments, the determination module 420 may determine a response strategy based on processing interaction instructions and historical data. For example, the determining module 420 may process at least one of the interaction instruction and historical data based on the response strategy model to determine the response strategy. In some embodiments, the response strategy model may be a separate model, or any combination of the first classification model, the second classification model, and the third classification model.

In some embodiments, the determining module 420 can obtain information such as current weather and real-time road conditions. In some embodiments, the determining module 420 may adjust the response strategy according to information such as current weather and real-time road conditions. For example, when the weather is bad, the weight of "soothing" emotion in the response strategy (for example, responding to emotion) can be appropriately increased. For another example, when the weather is fine and the road conditions are smooth, the weight of the "happy" emotion in the response strategy (for example, responding to emotion) can be appropriately increased.

In some embodiments, the determining module 420 may compose a feature sequence of interactive instructions of the target object, historical data of the target object, current weather, and real-time road conditions, and input the combined feature sequence into the embedded model based on the RNN model to obtain the instruction representation vector . Further, characteristics such as the instruction representation vector and the information dimension data of the service platform are input into the response strategy prediction model to obtain the response strategy output by the response strategy prediction model. The embedded model and the response strategy prediction model can be obtained through joint training.

Among them, the combined feature sequence is composed of feature combination values at several time points. The characteristic combination value at each time point is formed by the combination of the target object's interactive instruction, the target object's historical data, weather data, and real-time road condition data at that time point, and is multiplied by the time weight coefficient. The time weight coefficient can be different depending on the distance of the time point, and the weight coefficient of the time point closer to the current time can be larger. In some embodiments, when forming the characteristic combination value, a transformation can be made to the real-time road conditions according to the current weather, so that the characteristic value reflecting the congestion of the road conditions under extreme weather is reduced, so as to reduce the influence of the special data. In some embodiments, the aforementioned transformation may be: reducing the weight of the real-time road condition to 0.01.

Through the combination of the above methods, the influence of different factors on the forecast results can be better reflected, especially the interrelationship between these factors. For example, the impact of weather is related to the front and back, and the impact of real-time road conditions is also related. RNN-based processing can reflect the relationship between the front and back time points and make the response strategy more accurate.

Step 530, based on the response strategy, determine the response language for the target object. In some embodiments, step 530 may be performed by the response module 430.

Responsive speech is the language information that the device or system responds to the target object in response to the interactive command. It is understandable that response speech is a specific language output, and the language may contain information such as response emotion, response style, and/or response content. For example, a response phrase "The driver is great!" is played that is full of passion (that is, responding to emotions) (that is, responding style) and boasting that the driver's service is very good (that is, the response content). It is understandable that there can be one or more languages expressing the same response strategy (for example, response content, response style, or response emotion), that is, there can be a one-to-one or one-to-many relationship between response strategies and response words.

In some embodiments, the response module 430 may determine response words based on the response strategy. In some embodiments, the response module 430 may determine the content that needs to be expressed in response to words according to the determined response content, and then determine the specific expression mode of the response content according to the response emotion and/or response style, including whether to add modal particles and degree words Or other words that embody emotions and styles, or output voice intonation, etc.

For example, if the response strategy is praised response content, the output response words are praised language, for example, "you are good". The response strategy is an exaggerated response style, and the degree words "very" and "very" are added to the response content, for example, "you are very good". For another example, if the response strategy is the response emotion of joy, the output response words can be with some modal particles expressing joy, for example, "You are very good" and so on.

In some embodiments, the database may store preset words for the same or different response content, response emotion and/or response style, so that according to the response strategy (response content, response emotion and/or response style), from Response words corresponding to the response strategy obtained in the database. The stored preset words can be customized and recorded in advance by the user (for example, the user on the driver side or the user on the passenger side, etc.), or can be preset by the developer in advance.

In some embodiments, the language or words corresponding to the response content, response emotion and/or response style can also be extracted from a public platform (for example, Wikipedia, etc.), and response words can be generated.

In some embodiments, the response words can also be generated through models or algorithms, for example, transformers, Bert models, and so on. For example, input the dictionary and designated words into the model, and output the response words.

As mentioned earlier, there can be a one-to-many or one-to-one relationship between response strategies and response words. In some embodiments, based on the response strategy, one or more preset words may be obtained from the storage device, or one or more preset words may be generated. When the number of preset words is multiple, it is necessary to determine a response word among the plurality of preset words to output. In some embodiments, the terminal or the response module 430 may automatically select a preset speech as the response speech according to a preset rule among a plurality of preset response speeches, and output the response speech. For example, the terminal can randomly select a preset speech as the response speech. Alternatively, the terminal may use the user or user group's most frequently used preset speech as the response speech. The user group can be all users, all passengers, all drivers, and all users in the user's area (for example, a city, district, or a custom area, such as a circular area within 5 kilometers, etc.).

In some embodiments, the response module 430 may output response words to the target object by responding to text and/or responding to voice. For example, the text corresponding to the response words is converted into speech and output to the target object. In some embodiments, whether to output the response voice or output the response text may be determined according to the actual scene. For example, when outputting response words, if the current terminal is the driver's end and the driver's end is currently in a vehicle driving state, only the response voice can be output. At this time, avoid outputting the response text to distract the driver, and avoid the driving safety problems caused by this. In addition, in this scenario, the response voice and response text can also be output at the same time. For another example, when outputting response words, it can be detected whether the terminal is in an audio or video playback state; if so, the response text can be output; otherwise, one or more of the response text and the response voice can be output. For another example, the response words can be displayed in the preset display interface, and can also be displayed in the status bar or notification bar. For example, if the driver is in the driving state of the vehicle, the response words can be output on the current display interface; If the current terminal is in the audio and video playback state, you can display response words in a small window in the status bar or notification bar.

In some embodiments, the semantics of the response voice and the response text may be the same or different, and may be specifically set according to the scene. For example, for the same response content "praising the driver for good service", the response voice can be "the driver is the most sunny", the response text can also be "the driver is the most sunny", the semantics of the two are the same; the response text can also be "wind in the wind" In the rain, thank you for your hard work."

In some embodiments, the response module 430 may also output the response words to the target object in other ways. For example, an image or video can be used as the response language to be output to the target object, for example, a video or picture expressing the actual content of the response language can be produced as the output.

As mentioned above, the target object can be the user or communication user of the current terminal. In some embodiments, the interactive instruction or voice data may be directed to the user, or to the user of the communication partner.

Take, for example, that the interactive instruction contains the specified words to praise the user. Specifically, take a pair of driver and passenger terminals that are currently communicating as an example. Refer to Figure 6 and related descriptions for the designated words. For the driver’s end, if the voice data sent by the driver includes a designated speech as "praise me" or "praise the driver", then the designated speech is for itself; or, if the driver’s end If the voice data contains the designated speech technique "Praise the Passenger", the designated speech technique is aimed at the opposite user of the current communication, that is, aimed at the passenger side. Conversely, for the passenger side, if the voice data sent by the passenger side contains the designated speech technique as "praise me" or "praise the passenger", then the designated speech technique is for itself; or, if the voice data from the passenger end The sent voice data contains the designated speech technique "boasting the driver", then the designated speech technique is for the opposite user of the current communication, that is, for the driver's side.

Based on the different target objects targeted by the interactive instruction, when the output of the response language is executed, the response language may be output to the own terminal (ie, the current terminal) or/and to the terminal of the other user respectively. In other words, when the interactive instruction is directed to itself, the response language is directly output on the current terminal; or, when the interactive instruction is directed to the opposite user of the current communication, the response language is output to the opposite user’s terminal, and the corresponding language can also be output like the current terminal. Surgery.

Fig. 6 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 600 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 610: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 610 may be performed by the extraction module 410.

In some embodiments, the interaction instruction of the target object includes voice data. The associated feature of the target object may include the voice feature of the voice data. As shown in FIG. 5, the voice data can be any one or a combination of multiple languages, and the voice feature can be the voice feature of the multilingual voice data. In some embodiments, the associated feature of the target object may further include: the text feature of the text data corresponding to the interactive instruction. For the manner of extracting voice features and text features, please refer to step 510 and related descriptions, which will not be repeated here.

Step 620: Process the voice features based on the first classification model, and determine whether the voice data contains a specified language of any one of multiple languages. In some embodiments, step 620 may be performed by the determining module 420.

The designated language can be a language that contains (but is not limited to) specific words or sentences. The designated speech can be used to determine the semantic information in response to the speech. The designated words can be characteristic words or approximate semantic words that need to be included in the response words, or the target of the response words, etc. The designated speech technique can be determined according to actual needs. For example, for the scene of complimenting the service provider, the designated language can include "praise", "comrade", "encourage", "reward" or "good", etc., and the designated language can also include "passenger", "driver", and so on. "Service provider", "service requester", "I", etc.

The first classification model refers to a calculation model implemented by a computing device, and the first classification model is a model that determines whether or not it contains a specified language. In some embodiments, the first classification model may process the voice features and output the classification recognition result to determine whether it contains the specified words. The first classification model can process the text features and output the classification recognition results to determine whether the specified words are contained. The first classification model can process the voice features and text features to determine whether it contains the specified words. See step 1730 for more details about the classification and recognition results. In some embodiments, if the first classification model can process speech features and text features, the first classification model can include two sub-classification models, which are respectively a classification model for processing speech features and a classification model for processing text features. Classification model. The first classification model may also be a model for processing text features and voice features, and the voice features and text features can be input into the first classification model in the same form (for example, a vector or a normalized vector). The first classification model can be obtained through end-to-end training.

In some embodiments, the first classification model may have a one-to-one or one-to-many relationship with the language. For example, all languages can correspond to the same first classification model, and for example, different languages correspond to different first classification models.

In some embodiments, when different languages correspond to different first classification models, the speech data can be recognized (for example, by a speech decoder, etc.) before the speech features are input into the first classification model, and the speech data can be recognized. Language, further, input the voice feature into the first classification model corresponding to the recognized language, and determine whether the voice data contains the specified words.

In some embodiments, multiple languages can correspond to the same first classification model, that is, the first classification model can process the voice features of multiple languages, and determine whether the voice data contains any one of the multilingual specified words. .

The first classification model or sub-classification model is a machine learning model. In some embodiments, the first classification model or sub-classification model may be a classification model or a regression model. The types of the first classification model or sub-classification model include but are not limited to neural network (NN), convolutional neural network (CNN), deep neural network (DNN), neural network (NN), convolutional neural network (CNN), loop A neural network (RNN) or any combination thereof, for example, the second classification model or sub-classification model may be a model formed by a combination of a convolutional neural network and a deep neural network.

In some embodiments, the first classification model or sub-classification model may be composed of a multi-layer convolutional neural network CNN and a multi-layer fully connected network. The first classification model or sub-classification model can be composed of a multi-layer CNN residual network and a multi-layer fully connected network. For example, the first classification model may be a 5-layer CNN residual network and a 3-layer fully connected network.

In some embodiments, the first classification model can construct a residual network on the CNN network to extract the hidden layer features of the voice data, and then use the multi-layer fully connected network to map the hidden layer features output by the residual network. The softmax transfer function (softmax) classification output obtains the multi-class recognition result.

In some embodiments, in the first classification model used, compared to a single fully connected network, the CNN network structure can be used to extract features, so that while ensuring the recognition accuracy, the scale of network parameters can be effectively controlled not to be too large. , To avoid the problem that the scale of the first classification model is huge and it is difficult to effectively deploy on the terminal side.

In some embodiments, the sub-classification model used to process text features may also be a text processing model, for example, a Bert model.

In some embodiments, the first classification model may be obtained by offline training in advance and deployed on the terminal device. The first classification model can be obtained by offline training in advance, and is deployed and stored in a storage device or deployed on the cloud. The terminal device has access rights to the storage device or the cloud. The first classification model can also be obtained by online training based on current data in real time. For details about the training of the first classification model, refer to the related description in FIG. 7, which will not be repeated here.

In some embodiments, the determining module 420 may also determine whether the voice data contains a specified language of any one of multiple languages in other ways. For example, you can process text features through rules to determine whether it contains specified words. For another example, fusion (eg, weighted summation, weighted average, etc.) is based on the result of processing and determining the voice feature based on the first classification model, and the result of processing and determining the text feature based on the rule, and finally determines whether the interactive instruction contains the specified words. Surgery.

Step 630, in response to the voice data containing the designated speech, determine the response content corresponding to the designated speech. In some embodiments, step 630 may be performed by the response module 430.

In some embodiments, when the recognition result of the first classification model indicates that the voice data contains the specified speech, the response content corresponding to the specified speech is determined, and the response speech is further determined and output based on the response content.

As mentioned above, the response content can be semantic information. When the voice data contains a specified phrase, the response content can be a specified phrase or content similar to the meaning of the specified phrase.

As mentioned earlier, the same response content can be expressed in multiple languages, that is, corresponding to multiple response words. It is understandable that the designated speech technique may correspond to one or more response words, that is, the specified speech technique and its corresponding response words may have a one-to-one or one-to-many relationship. For different designated words, the corresponding response words can be the same or different. In some implementations, the specified words and their corresponding response words may be stored in a database or a storage device, and the response words corresponding to the specified words may be obtained from the storage device.

In some embodiments, based on the response content, the response emotion or response style can also be determined, so as to output response words in combination with the response content, response emotion or response style. For the determination of response emotion and response style, see Figure 8, Figure 10 and related descriptions. For more information about obtaining response words, please refer to step 530 and its related descriptions.

Fig. 7 is an exemplary flowchart of the first classification model training according to some embodiments of the present application. In some embodiments, the process 700 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 710: Obtain multiple first training samples. In some embodiments, step 710 may be performed by the acquisition module 440.

In some embodiments, the first classification model may be trained based on a plurality of first training samples. Each first training sample may include voice data, that is, first sample voice data. In some embodiments, the first sample voice data may be multilingual voice data. For example, the first sample voice data may be English voice data, Japanese voice data, Chinese voice data, Korean voice data, etc., which are not exhaustive.

The plurality of first training samples include positive samples and negative samples. The positive sample is the first sample of speech data related to the specified speech. For example, the positive sample is the first sample speech data carrying the specified speech, or the first sample speech data carrying the semantic similar to the specified speech. The negative sample is the first sample of speech data that is not related to the specified speech, for example, does not contain the specified speech or does not contain the same meaning as the specified speech.

In some embodiments, positive and negative samples can be identified, for example, positive samples are 1 and negative samples are 0. The output result of the first classification model obtained by training can be a probability between 0-1, or whether it is a classification result that includes a specified language. For more details about the results output by the first classification model (ie, classification recognition results), please refer to FIG. 17 and related descriptions.

In some embodiments, the first training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the first training sample.

In some embodiments, a sample speech recognition result of the first sample speech data may be obtained first, and the sample speech recognition result is used to indicate whether the first sample speech data is related to a specified speech. Further, the first sample voice data can be identified or positive and negative samples can be classified according to the result of sample speech recognition.

In some embodiments, the sample speech recognition result may be a sample text recognition result, or a label obtained by manual annotation, or a combination of the two. In other words, it is possible to convert the first sample speech data into the first sample text data, and perform text recognition for the specified speech in the first sample text data to obtain the sample speech recognition result of the first sample text data; And/or, receiving a sample speech recognition result manually determined based on the first sample voice data and/or the first sample text data.

In some embodiments, the first sample voice data can be converted into the first sample text data based on a voice decoder (also called a voice converter), and further, the first sample text data can be recognized or Analyze to determine whether the first sample text data is related to the specified speech, so as to determine the identity of the positive and negative samples or the first sample voice data. For example, it is determined whether it is related to the specified words by means of keyword matching or text similarity. Exemplarily, the text similarity between the first sample text data and the specified verbal text is calculated by text matching (for example, Euclidean distance, etc.), if the text similarity reaches (greater than, greater than or equal to) the preset similarity Threshold, the first sample text data is a positive sample; otherwise, the first sample text data is a negative sample. The similarity threshold is not particularly limited, and may be 80%, for example. In some embodiments, the character standard of the first sample text data can also be calculated, and the character standard can be used as an evaluation criterion for calculating text similarity.

As mentioned above, the first sample voice data may be multilingual voice data. In some embodiments, the first sample voice data may be converted into the first sample text data based on the voice converter corresponding to the language of the first sample voice data, and it is further determined whether the specified words are included, that is, the first sample voice data is determined. Whether the voice data is a positive sample or a negative sample.

In some embodiments, if the first sample text data cannot be obtained, the decoding accuracy of the speech decoder cannot meet the preset recognition requirements, or the accuracy of determining positive and negative samples through text similarity or keyword matching is low At the same time, it can also be combined with manual labeling to achieve classification. For example, output results based on text similarity, unsuccessful recognition or first sample speech data with low recognition accuracy can be output on the screen so that users can verify or correct (label) the automatic classification results. After that, the result of manual labeling is used as the identification.

In some embodiments, the ratio of positive samples and negative samples can be controlled. For example, the ratio of positive samples to negative samples can be controlled to 7:3. In this way, when performing the processing of this step, the possibility of screening the positive sample and the negative sample may also be involved, so that the ratio of the positive sample and the negative sample is within a preset ratio range.

As mentioned earlier, the first classification model can handle text features. In some embodiments, the first training sample may also include first sample text data. Correspondingly, the positive samples are related to the specified speech, and the negative samples are not related to the specified speech.

Step 720: Train an initial first classification model based on the multiple first training samples to obtain the first classification model. In some embodiments, step 720 may be performed by the acquisition module 440.

In some embodiments, the first classification model may be trained through various methods based on the first training sample, and the parameters of the initial first classification model are updated to obtain the trained first classification model. Training methods include but are not limited to: based on gradient descent method, least square method, variable learning rate, cross entropy, stochastic gradient, and cross-validation methods. It is understandable that after the training of the positive sample and the negative sample is completed, the obtained first classification model has the same network structure as the initial first classification model.

In some embodiments, the voice features of the positive and negative samples can be extracted, and then the voice features of the positive and negative samples are used for model training. It should be understood that the manner of performing speech feature extraction for the positive and negative samples is the same as that of the foregoing step 610, and will not be repeated here.

In some embodiments, during training, the positive sample and the negative sample data can be mixed training according to a certain ratio, for example, a ratio of 7:3. In some embodiments, this can be achieved by means of whole sentence training. For example, the first sample voice data is the voice data of a complete sentence and so on.

In some embodiments, when the trained first classification model meets a preset condition, the training ends. Wherein, the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.

In some embodiments, the parameters in the initial first classification model can be initially assigned, and then the positive and negative samples are used for training, combined with the accuracy of the output results of the positive and negative samples, to compare the parameters in the initial first classification model. The parameters are adjusted, and the training process is repeated multiple times, and finally a parameter with a higher accuracy of the classification result is obtained, and the parameter is used as the parameter of the first classification model.

In some embodiments, a test set can be constructed and used to test the classification result of the first classification model. Specifically, the real performance of the first classification model can be evaluated by calculating the accuracy rate and the false recognition rate of the prediction result, and based on the real performance, the parameters of the first classification model can be adjusted, or further training processing can be determined.

As mentioned above, the first classification model can have a one-to-one or many-to-many relationship with a language. In the case of a one-to-one relationship, training is performed by the first training sample of the corresponding language of the first classification model. In the case of a one-to-many relationship, training is performed through the first training samples corresponding to multiple languages of the first classification model.

In some embodiments, similar to the second classification model and the third classification model, the acquisition module 440 can verify, test, and update the first classification model. For details, refer to

steps

920, 1120 and related descriptions.

Fig. 8 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 800 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 810: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 810 may be performed by the extraction module 410.

In some embodiments, the interaction instruction may include voice data, and the associated feature may include at least one of a voice feature of the voice data and a text feature of the text data corresponding to the interaction instruction. Refer to steps 510 and 610 for the associated features and their extraction, which will not be repeated here.

Step 820: Process at least one of the voice feature and the text feature based on the second classification model, and determine the respondent emotion of the voice data. In some embodiments, step 820 may be performed by the determining module 420.

The respondent emotion refers to the emotion of the interactive command. The types of emotions that are responded to may include: loss, calm, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc. The responding emotion can be obtained based on the processing of the interactive instruction.

The second classification model refers to a computing model implemented by a computing device, and the second classification model is a model for determining the response emotion of an interactive command. The second classification model can be a binary classification or a classification model. For the type of the second classification model, refer to the first classification model, that is, step 620 and related descriptions. The first classification model and the second classification model type can be the same or different.

In some embodiments, the second classification model can process speech features to determine the responding emotion. The second classification model can process the text features and determine the response sentiment. The second classification model can process voice features and text features to determine the response emotion. It is understandable that, similar to the first classification model, if the second classification model can process speech features and text features, the second classification model can include two sub-classification models, which are the classification models that process speech features, and the A classification model for processing text features. The second classification model can be obtained through end-to-end training. For determining the response emotion based on the output result of the second classification model, refer to the related description in FIG. 20 or 21.

The second classification model can be obtained through training. Regarding the training of the second classification model, refer to the related description in FIG. 9, and details are not repeated here. Regarding the deployment of the second classification model, it may be similar to the first classification model, see step 720 and related descriptions.

In some embodiments, the second emotion obtained by processing the text feature and the first emotion obtained by processing the voice feature may determine the response emotion based on the first emotion and the second emotion.

The first emotion refers to the emotion determined based on the voice feature. In some embodiments, the first emotion may be obtained based on the processing of the voice feature by the second classification model. It is understandable that when the second classification model determines the first emotion based on the speech features, in addition to the semantic information of the language in the speech data, the intonation information or tone information in the speech data can be combined to make the determined emotion more accurate.

The second emotion refers to the emotion determined based on the text feature. The type of the second emotion may be the same as or different from the type of the first emotion. The text features include keyword features. In some embodiments, the second emotion may be determined based on the keyword features.

The types of the first emotion and the second emotion may include, but are not limited to, loss, peace, enthusiasm, passion, joy, sadness, pain, comfort, excitement, etc.

In some embodiments, the keyword features include emotion-related words, and the determining module 420 may determine the second emotion based on the emotion-related words.

Emotion related words refer to words that can indicate emotions in interactive instructions. In some embodiments, emotion-related words may include, but are not limited to: one or more of modal particles and degree words. For example, modal particles can include: "please", "ba", "ah", "Ma", etc., degree words can include, but are not limited to: "very", "very", "relentless", etc., which are not exhaustive .

In some embodiments, emotion-related words in the text data can be identified, and the second emotion can be determined according to the emotion-related words. For example, based on preset rules and based on the recognition result of the emotion-related words, the emotion corresponding to the recognition result may be used as the second emotion. Different emotion-related words may have a mapping relationship with emotions, and the mapping relationship may be manually preset and stored in a storage device. For example, if the emotion corresponding to "Ah" is preset as "joy", the emotion corresponding to "Ma" is "sad" and so on. Take the exaggeration scene as an example. If the voice data sent by the user is converted into text data, the content is "Can you praise me?", the second emotion is: sad; if the content is "praise me", second The emotion is: joy. For another example, if it is preset that the corresponding emotion when there are both a degree word and "Ah" is "excited", and the emotion corresponding to a degree word and "Ma" is also "sad".

In some embodiments, the sentiment score may be preset for each sentiment related word, that is, the scores of the sentiment related words corresponding to different emotions. All emotion-related words in the text data can be identified, and the emotion scores of these emotion-related words can be calculated or weighted, and the second emotion can be determined based on the calculated value.

In some embodiments, the second emotion may also be determined in other ways. For example, text features can be input into a text processing model (for example, Bert, etc.) to determine the second emotion.

In some embodiments, the first emotion and the second emotion can be expressed numerically, and then the two values can be calculated or weighted (for example, weighted averaging) to obtain a weighted value, and then, based on the weighted value, it is determined that the response is emotion. For example, different emotions can correspond to different numerical values or numerical ranges, for example, excitement is 2, joy is 1, pain is -1, and so on. The value can be set in advance according to requirements or rules.

In some embodiments, the second classification model can output the probability value corresponding to the first emotion, and the text processing can obtain the probability value corresponding to the second emotion, which can be based on weighting the probability value (for example, multiplying with the weight or comparing it with the weight). Plus, etc.), the emotion with the highest probability value is regarded as the responding emotion.

In some embodiments, when the first emotion and the second emotion are the same or similar, the first emotion or the second emotion is regarded as the responding emotion. When the first emotion is not similar to the second emotion (for example, completely opposite, etc.), the first emotion can be taken as the respondent emotion, or the respondent emotion can be manually determined.

Step 830: Determine the response emotion based on the response emotion. In some embodiments, step 820 may be performed by the determining module 420.

In some embodiments, the responding emotion and the responding emotion may be similar, the same, or opposite. For example, the responding emotion is joy, and the responding emotion can be joy or excitement. For another example, the responding emotion can be loss, and the responding emotion can be calm or joy. In some embodiments, the response emotion may only include positive emotions to appease the user's negative emotions.

In some embodiments, the corresponding relationship between the response emotion and the response emotion may be preset. The response emotion and response emotion can be one-to-one or many-to-many relationships. The preset method can be determined based on rules, and can also be determined or optimized based on historical feedback data. For example, according to the user's feedback information on the response words.

In some embodiments, the corresponding relationship between the response emotion and the response emotion may be stored in the storage device in advance. For example, it is stored in the memory of the terminal, or in other storage locations readable by the terminal, such as the cloud, which is not limited. Based on the response emotion, the response emotion can be obtained from the storage device.

In some embodiments, the response module 430 may determine and output response words based on the response emotion. Refer to step 530 and related descriptions for determining response words based on response emotions.

Fig. 9 is an exemplary flowchart of training a second classification model according to some embodiments of the present application. In some embodiments, the process 900 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 910: Obtain multiple second training samples. In some embodiments, step 910 may be performed by the acquisition module 440.

In some embodiments, the second training sample may include: one or more of the second sample speech data and the second sample text data, and the corresponding emotion label. For example, the second training sample may include the second sample speech data and the corresponding emotion label. For another example, the second training sample may include the second sample speech data, the second sample text data, and the corresponding emotion label. The emotion label is the emotion corresponding to the second sample speech data or the second sample text data. In some embodiments, the type of emotion tag may be one hot tag.

The second sample text data is text data corresponding to the second sample voice data. For example, the second sample text data is obtained by performing text recognition on the second sample voice data. For another example, the second sample voice data is obtained based on machine conversion or manual reading of the second sample text data. In some embodiments, the second sample voice data may come from real online voice data; and/or, it may also come from customized data. For example, based on the content of the second sample text data, the second sample voice data is generated by manual reading. For example, the text content of the second sample and the corresponding tone standard can be formulated, and then artificial and emotional reading is used to obtain the second sample voice data of different tone and emotion.

In some embodiments, the length of the text corresponding to the second sample voice data or the second sample text data is generally not too long (for example, the individual input of a word or word is less than a threshold value, which may be 20, 10, etc.). Because too long text will cause the speech to be too long, and further may cause greater fluctuations in the tone, and cause the environmental noise to be more random and complicated.

Step 920: Train the initial second classification model based on the multiple second training samples to obtain the second classification model. In some embodiments, step 920 may be performed by the acquisition module 440.

In some embodiments, the second classification model may be trained through various methods based on the second training sample, and the parameters of the initial second classification model are updated to obtain the trained second classification model. The training method may be similar to the first classification model, see step 710 and related descriptions. It is understandable that after the training of the second training sample is completed, the obtained second classification model has the same network structure as the initial second classification model.

In some embodiments, feature extraction can be performed on the second sample voice data or/and the second sample text data, and the second classification model can be trained based on the extracted voice features and/or text features.

In some embodiments, if the second classification model is used to process speech features, the second training sample used for training includes the second sample speech data and its emotion labels. If the second classification model is used to process text features, the second training sample used for training includes the second sample text data and emotion labels. If the second classification model is used to process voice features and text features, the second training sample used for training includes the second sample voice data, the second sample text data and the emotional label, and the training is done through an end-to-end manner.

In some embodiments, after the sample features are extracted, a training method of indefinite length of the whole sentence can be adopted, and the features extracted from one sentence are used as the input of the classifier to obtain the emotion output by the second classification model, and then use the output emotion and Adjust the parameters of the second classification model based on the difference between the sentiment labels, and finally obtain the second classification model with higher classification accuracy.

In some embodiments, model verification can also be performed on the current training model. Model verification is divided into two processes: test environment construction and model testing. The test environment is built to check whether the current model can be successfully built and run normally on different terminals, such as different brands of mobile phones. Therefore, the test needs to be tested offline according to the real scene.

The test process can include but is not limited to the following two test methods. The first test method can test multiple times for real people in real time, and then count the accuracy of the recognition results. The advantage of this test method is that it can better simulate user behavior in real scenarios, and the reliability of the test is higher. The second test method is that a real person records a test set in a real scene. One or more test sets can be recorded as needed, which can be reused, with lower cost, and the objective validity of the test can be guaranteed to a certain extent.

Fig. 10 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. In some embodiments, the process 1000 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 1010: Extract the associated features of the target object based on the interactive instruction for the target object. In some embodiments, step 1010 may be performed by the extraction module 410.

In some embodiments, the associated feature includes at least one of the first feature corresponding to the interactive instruction and the second feature corresponding to the historical data of the target object. Regarding the associated feature and its extraction, refer to step 510, which will not be repeated here.

Step 1020: Process at least one of the first feature and the second feature based on the third classification model, and determine a response style for the target object. In some embodiments, step 1020 may be performed by the determining module 420.

The third classification model refers to a calculation model implemented by a computing device, and the third classification model is a model that determines the response style of the target object. The third classification model can be a multi-class or two-class model. For the type of the third classification model, please refer to the first classification model, which will not be repeated here. The first classification model can be the same or different from the third classification model. The third classification model can be obtained through training. For the training of the third classification model, refer to the related description in FIG. 11 for details. For the deployment of the third classification model, refer to the deployment of the first classification model, and refer to step 620 and related descriptions.

In some embodiments, the processing device may process at least one of the first feature and the second feature based on the third classification model to determine the response style. For example, the first feature and the second feature are input to the third classification model, and the response style is output. Optionally, when the third classification model is processed, the first feature and the second feature can be weighted respectively (for example, the first feature The weight of is greater than the weight of the second feature). For another example, the first feature or the second feature is input to the third classification model, and the response style is output. For the first feature and the second feature, refer to step 510 for details.

In some embodiments, the processing device may process the text feature of the text data corresponding to the interactive instruction, and process at least one of the first feature and the second feature based on the third classification model to determine the response style.

The first style is a style determined based on the first feature or/and the second feature. For example, processing at least one of the first feature and the second feature based on the third classification model to determine the style.

The second style is a style obtained based on text feature processing. In some embodiments, the second style may be the same or different from the first style, including but not limited to: exaggerated, normal exaggerated, slightly exaggerated, and the like.

In some embodiments, the text features can be processed based on models, algorithms, or rules to obtain the second style. For example, by identifying whether the text contains keywords related to style (such as very, very, etc.), the second style is further determined based on the keywords. For another example, the text features are processed through a text processing model (for example, Bert, DNN, etc.), and the second style is output.

In some embodiments, the response style may be determined based on at least one of the first style and the second style. For example, any one of the first style or the second style may be determined as the response style. For another example, if the first style is different from the second style, the first style can be used as the response style.

In some embodiments, fusion processing may be performed on the first style and the second style to determine the response style. For example, different styles can correspond to different scores (for example, exaggerated to 3, normal exaggerated to 2, etc.), you can set different weights for the first style and the second style, and merge the weights and scores of the two styles, based on The resulting fusion determines the response style. The response style can also be determined by other fusion methods, which is not limited in this embodiment.

Fig. 11 is an exemplary flowchart of training a third classification model according to some embodiments of the present application. In some embodiments, the process 1100 may be executed by a processing device (for example, the processing device 112, the processor 210, or the central processing unit 340).

Step 1110: Obtain multiple third training samples. In some embodiments, step 1110 may be performed by the obtaining module 440.

The third training sample can be derived from real data (for example, historical data), or can be derived from formulated data. For example, the developer can formulate sample data and input it into the terminal so that the terminal can train the third classification model.

In some embodiments, the third training sample includes sample interaction instructions for the sample target object, sample history data of the sample target object, and corresponding style labels. The method and content of the sample historical data and the aforementioned historical data can be the same, and will not be repeated here.

The style label represents the sample response style for the sample target object. The sample style label may be a onehot label. In some embodiments, the style label may be determined based on the reputation data and feedback data of the sample target object. Specifically, determining the style label includes: obtaining feedback data of the sample target object’s response to the sample, which is determined based on the sample interaction instruction; obtaining the reputation data of the sample target object; and determining the style based on the reputation data and/or feedback data label.

Feedback data refers to the data related to the response of the sample target to the sample's response words. For example, the response includes, but is not limited to, positive (or positive), negative (or negative), positive, negative, etc. For the sake of clarity, the response can be used to directly characterize the feedback data. For example, if the sample target has no feedback, the feedback data can be set to positive by default. If the feedback information is positive or positive evaluation, the feedback data is positive or positive, etc.

Reputation data refers to data related to the credit status of the sample target object. Reputation data can be determined based on historical data. The reputation data can be specifically expressed as a reputation score, and the calculation method of the reputation score is not repeated here. Exemplarily, when the reputation score reaches (greater than, or, greater than or equal to) 80 points, the target object of the sample is a user with a high reputation, and the reliability of the feedback information is high, which can also give developers a reference and assistance The developer manually annotates the sample data.

In some embodiments, the reputation data and/or feedback data may be associated with the response style, and based on the corresponding relationship, the style label may be determined. The reputation data and feedback data can also be processed through the model to determine the style label, where the model can be a DNN model, a CNN model, an RNN model, etc.

In some embodiments, the style label may also be determined based on manual evaluation of reputation data and feedback data. For example, the terminal can obtain feedback data of the sample target object for historical response speech, and then output the reputation data and feedback data of the sample target object, and receive manual evaluation data for the reputation data and feedback data. The manual evaluation data is used to indicate style label. In this embodiment, although the style label is also obtained based on the developer's manual labeling, this kind of manual labeling is implemented based on the feedback data and reputation data output by the terminal, which is beneficial to assist the developer to complete the labeling quickly and reduce the labor cost as much as possible. Time and cost.

In some embodiments, the third training sample can be obtained from a storage device or a database. It is also possible to obtain historical data from the service platform, client, etc. as the third training sample.

Step 1120: Train an initial third classification model based on the multiple third training samples to obtain a third classification model. In some embodiments, step 1120 may be performed by the acquisition module 440.

In some embodiments, the third classification model may be trained through various methods based on the third training sample to update the parameters of the initial third classification model to obtain a trained third classification model. Training methods include but are not limited to: calculating loss based on gradient descent method, least square method, cross entropy, cross validation, variable learning rate, etc. It is understandable that after the training of the third training sample is completed, the obtained third classification model has the same network structure as the initial third classification model.

In some embodiments, the features of the third training sample (for example, voice features, text features) can be extracted, and then the features of the third training sample can be used for model training.

In some embodiments, when the trained third classification model satisfies a preset condition, the training ends. Wherein, the preset condition may be that the result of the loss function converges or is less than the preset threshold, or the training period reaches the threshold.

In some embodiments, an end-to-end training method is used, and the feature extracted based on the third training sample is used as input, and the style recognition result is output. Then, the difference between the output style recognition result and the style label is used to adjust the parameters of the initial third classification model, and finally a third classification model with higher classification accuracy is obtained.

In addition, in some embodiments, real-time data can also be used to update the third classification model. Taking the embodiment shown in FIG. 5 or 10 as an example, after outputting the response words, the terminal may also obtain operation information for the response words, so that the third classification model is updated by using the response words and the operation information. Among them, the operation information may be the operation information that the target object evaluates or feeds back on the response speech, and the operation information may also be used as a third training sample to update the third classification model in real time.

In some embodiments, the third classification model can also be tested and verified, which is similar to the second classification model and will not be repeated.

Fig. 12 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As mentioned above, the issuer of the interactive instruction for the target object can be the driver issued by the driver, and the target object can be the driver. As shown in Figure 12, a scenario where a user on the driver side brags is taken as an example. As shown in FIG. 12A, the driver-side user can click the function control 1201 in the driver-side display interface of the taxi APP to enter the exaggeration interface, then the terminal can display the interface as shown in FIG. 12B. Fig. 12B is a display interface with exaggerated function. On the display interface, the user at the driver end can make a voice. Accordingly, the terminal collects real-time voice data, that is, receives an interactive instruction. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver's end user includes one of "praise the driver" or "praise me", the display interface as shown in FIG. 12C can be displayed on the terminal. As shown in Figure 12C, the response word 1203 for "praising me" is displayed on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up".

In addition, in the display interface shown in FIG. 12B, the driver-side user can also click the boast control 1202 to trigger the boast function, and then display the interface as shown in FIG. 12C, which will not be described in detail. In the display interface shown in FIG. 12A, the function control 1201 can also prompt the driver's newly received exaggeration.

Fig. 13 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 13, a scenario where a user on the driver side brags is taken as an example. As shown in FIG. 13A, the driver-side user can click the function control 1301 in the driver-side display interface of the taxi APP to enter the exaggeration interface, and then the terminal can display the interface as shown in FIG. 13B. Figure 13B is the display interface of the exaggeration function. On this display interface, the driver's end user can make a voice, and accordingly, the terminal collects real-time voice data "praise me" (or "praise the driver"). An interactive instruction was received. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the driver-side user includes one of "praise the driver" or "praise me", the display interface as shown in FIG. 13C can be displayed on the terminal. As shown in Figure 13C, the response word 1303 for "praise me" is displayed on the current display interface, specifically: "The driver is the sunniest, enthusiastic, kind, and knows the cold and knows the hot!". Similar to FIG. 12, the driver-side user can also click the boast control 1302 to trigger the boast function, and then display the interface as shown in FIG. 13C.

Comparing Fig. 12 with Fig. 13, it can be seen that the two pictures correspond to different response styles. As shown in Figure 12, after the terminal collects the voice data, it can be determined that the response style preferred by the driver's user (target object) is normal exaggeration, and based on this, the response language with the response style can be determined. As shown in Figure 12C, the terminal displays response words 1203 for "praising the driver" or "praising me" on the current display interface, specifically: "in the wind and rain, thank you for your hard work to pick me up". As shown in Figure 13, after the terminal collects the voice data, it determines that the response style preferred by the driver's user (target object) is exaggerated, and accordingly determines the response style with the response style. As shown in Figure 13C, the terminal displays 1303 response words for "praise me" on the current display interface, specifically: "The driver is the sunniest, most enthusiastic, kind, and knows the cold and knows the hot!".

It can be seen that, for different target objects, based on their historical data, each preferred response style can be obtained, and the terminal can respond to different degrees of praise (response) based on different response styles. In addition to boasting, you can also praise the other user, so I won’t repeat them here. And, the user can also have the authority to modify the response speech.

Fig. 14 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 14, a scenario where the user on the passenger end exaggerates the driver end as an example. Figure 14A shows the communication interface between the user terminal and the driver terminal. On the communication interface, the passenger terminal user can click on the voice switch control 1401 to trigger the voice input function. At this time, the terminal displays the interface shown in Figure 14B. On the display interface, if the user presses down the voice input control 1402, the terminal can collect real-time voice data, that is, receive interactive instructions. Later, after the terminal collects the voice data, it can determine whether the collected voice data contains the specified speech. Then, if it is recognized that the real-time voice data from the user contains "boasting drivers", the display interface shown in FIG. 14C can be displayed on the terminal. As shown in Figure 14C, in the current communication interface, the user terminal sends response words 1403 to the driver terminal, specifically: "The driver is the most sunny, enthusiastic, kind, and knows the cold and knows the hot!". Correspondingly, for the driver side, it can prompt the user to receive a boast from the passenger side, for example, in the function control 1301 in the interface shown in Figure 13A, or in the notification bar or status bar. .

In addition, in the scene shown in FIG. 14, as shown in FIG. 14A, the passenger-side user can also click the boast control 1404 on the display interface to trigger the boast function. At this time, when the user clicks the exaggeration control 1404 to exaggerate, the user can enter the voice collection step, and the exaggeration can be realized in the manner shown in FIG. 12 or 13.

Fig. 15 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 15, you can also directly enter the boast interface. The communication interface shown in FIG. 15A is the same as the communication interface shown in FIG. 14A. The user on the passenger terminal can click the boast control 1404. At this time, the terminal displays the interface shown in FIG. 15B. On this interface, the terminal is determined to exaggerate, and can directly determine the response words for the driver's end user. At this time, if the user on the passenger end clicks on the send control 1405 of the response words, it will enter the interface shown in Figure 15C, and the user end will send the response words 1403 to the driver end, specifically: "The driver is the sunniest, most enthusiastic, and kind," Know the cold and the hot best!".

Fig. 16 is a schematic diagram of human-computer interaction according to some embodiments of the present application. As shown in Figure 16, the user can also have the authority to modify the response skills. The display interface shown in FIG. 16A is the same as the display interface shown in FIG. 15B. At this time, on the interface shown in Figure 16A, the currently determined response language of the terminal is "The driver is the sunniest, most enthusiastic, kind-hearted, and knows cold and hot!". If the passenger terminal user is not satisfied with the response language, he can click the language switching control 1601 to switch the response language. At this time, the terminal displays the controls shown in FIG. 16B. As shown in FIG. 16B, after the operation of the passenger terminal user, the currently determined response language is "the driver is the sunniest and most reliable person". In this way, the switching of response speech is realized. After that, the user clicks the sending control 1405 on the display interface, and the terminal can send the response speech to the driver.

In some embodiments, the current terminal may also perform statistical processing on historical response words and display them. In some embodiments, currently it can also be used to perform the following steps: obtain historical response words from other users, and then determine the total output of historical response words, and, according to the historical response words, determine one or more The speech tag, and furthermore, the total number of outputs and the speech tag are displayed. In some embodiments, the speech tag can be designed according to actual needs. For example, a scene of historical response to verbal skills can be used as a tag, and a scene of historical response to verbal skills and the number of times of historical response to verbal skills in the scene can also be used as the tag of verbal skills. For another example, the response style or emotion of the response language can be used as a label.

Still take the exaggerated scene shown in Figure 12 to Figure 16 as an example. Considering that users can exaggerate themselves, in actual scenes, exaggeration processing for themselves can be excluded, and historical exaggeration data from other users can be obtained for statistical analysis. For example, when the current terminal is the driver's terminal, you can count the exaggeration data of each passenger or other driver's terminal, and count the total number of exaggerations and the verbal label, and display it on the terminal's display interface . For example, in the scene shown in FIG. 12, the display interface of FIG. 12B shows that the driver has received 108 boasts, which is the total output of historical response words. In addition, Figure 2B also shows three linguistic tags, namely: "Rainy day boast 999+", "Late night boast 3" and "Holiday boast 66". The language tag in this scene consists of the exaggeration scene and the number of exaggerations in the scene.

Fig. 17 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. As described in step 510, the interactive instruction may include voice data. The first classification model can also be called a trained multilingual speech classifier (referred to as "multilingual speech classifier"). In some embodiments, the method of human-computer interaction includes:

Step 1710: Collect current voice data.

In some embodiments, it can be applied to a voice interaction scenario. In this scenario, the terminal can collect voice data sent by the user in real time and perform subsequent processing. The collected voice data can be any of Chinese voice, English voice, Japanese voice, Korean voice, etc., and there is no restriction here.

In some embodiments, after the user instructs to start the voice interaction function, the terminal can automatically monitor and collect the voice data sent by the user. Alternatively, the user can also press the semantic input button on the display interface to trigger and collect voice data.

Step 1720: Extract voice features in the voice data.

In some embodiments, the voice feature may be a multi-dimensional fbank feature. Specifically, the human ear's response to the sound spectrum is non-linear, the fbank feature is obtained by processing audio in a manner similar to the human ear, and the fbank feature is beneficial to improve the performance of speech recognition. Specifically, the Fbank feature in the voice data can be extracted through the following steps: signal conversion from time domain to frequency domain is performed on the voice data to obtain the frequency domain voice data; and the energy spectrum of the frequency domain voice data is calculated to obtain the voice feature. The voice data collected by the terminal device is a linear time-domain signal, and the (time-domain) voice signal can be transformed into a frequency-domain voice signal through Fourier transform (FFT). Specifically, during the signal conversion process, the voice data can be sampled. On this basis, the energy of each frequency band in the frequency domain signal is different, and the energy spectrum of different phonemes is different. Therefore, the energy spectrum of the frequency domain speech data can be calculated, and the speech features can be obtained. The method of calculating the energy spectrum will not be repeated here. For example, if the sampling frequency of the voice data collected in 1710 is 16khz, then the fbank40-dimensional feature can be extracted in this step.

In some embodiments, the voice data may also be preprocessed before the feature extraction step. For more details about preprocessing, see Figure 5 and its related description.

Step 1730: Use the trained multilingual speech classifier to process the speech features to obtain the classification recognition result. The multilingual speech classifier is used to determine whether the speech data contains any one of the multilingual specified words.

In some embodiments, the multilingual speech classifier can classify and recognize speech data in multiple languages. At this time, the language types that the multilingual speech classifier can recognize are consistent with the language types of the speech samples in the training process of the multilingual speech classifier.

In some embodiments, the classification recognition result may be a multi-classification result, including dual classification results. Specifically, the classification recognition result is used to indicate that the voice data is a positive sample or a negative sample; or, the classification recognition result is a degree level between a positive sample and a negative sample, and each degree level corresponds to a positive sample or a negative sample. Therefore, when the degree level corresponds to a positive sample, the classification recognition result indicates that the voice data contains the specified idiom; when the degree level corresponds to a negative sample, the classification recognition result indicates that the voice data does not contain the specified idiom.

In some embodiments, the classification recognition result can be divided into two types: "Yes" or "No". Among them, the classification recognition result is "Yes", it means that the voice data contains the designated dialects of one of the multilingual designated dialects; on the contrary, the classification recognition result is "No", it means that the voice data is compatible with any language. The specified words are irrelevant, and the specified words are not included in the voice data.

It should be understood that the classification recognition result may also have other manifestations. For example, the classification recognition result may be one or more of symbols, numbers, and characters (including characters of various languages, such as Chinese characters and English characters). For example, the classification recognition result can be "+" or "-"; or, the classification recognition result can also be "positive" or "negative"; or, the classification recognition result can also be "result 1" or "result 2"; or , The classification recognition result can also be "positive sample" or "negative sample". In the two-category results, the results indicated by the aforementioned representations can be customized. For example, if the classification recognition result is "Yes", it can mean that the voice data has nothing to do with the specified dialects in any language, and the specified dialects are not included in the voice data; the classification recognition result is "No", which can mean that the voice data contains more than one language. A language-specific language in a language-specific language.

In the embodiment of the dual classification result, the indication of the classification recognition result can be confirmed directly according to the dual classification result.

In some embodiments, the classification recognition result may also be n levels, and n is an integer greater than 1. At this time, n levels refer to the degree to which it is recognized that the speech data belongs to the positive sample to the negative sample. For example, if the level is the highest, it is judged that the voice data belongs to a positive sample higher; conversely, the lower the rank is, it is judged that the voice data belongs to a negative sample and the degree of a positive sample is lower. For example, if the classification and recognition result is n, the level is the highest, and it is judged that the voice data belongs to the positive sample; or if the classification and recognition result is 1, the level is the lowest, and the voice data is judged to be the lower degree of the positive sample. . The opposite can also be established. That is, if the level is the highest, it is determined that the voice data belongs to a positive sample to a lower degree; conversely, the lower the level is, it is judged that the voice data belongs to a positive sample to a higher degree.

In some embodiments, in the embodiment of multiple classification results, it is also necessary to determine the indication of the classification recognition result based on the classification result. At this time, the respective levels corresponding to the positive samples and the negative samples can be preset. For example, for 10 classification results (n is 10, 10 levels in total), levels 1 to 5 can correspond to negative samples, and levels 6 to 10 can correspond to positive samples. Then, if the classification level result is 1, the classification recognition result indicates that the speech data does not contain the specified speech; if the classification level result is 8, the classification recognition result indicates that the speech data contains the specified speech.

In some embodiments, the positive sample and the negative sample are training samples used in the training phase of the multilingual speech classifier, wherein the positive sample is the multilingual speech data carrying the specified speech, and the negative sample is irrelevant to the specified speech. Multilingual voice data. It should be understood that the positive sample (or negative sample) in the training sample contains voice data in multiple languages, and the positive sample (or negative sample) involved in the classification recognition result refers to the recognition that the voice data is a positive sample (or negative sample). The voice data of one language in the sample). See Figure 7 and related descriptions for the training process of the multilingual speech classifier.

Step 1740: When the classification recognition result indicates that the voice data contains the specified speech, output the response speech for the specified speech.

In some embodiments, the response speech for the specified speech may be directly output. The response words may include, but are not limited to: one or more of response voice and response text. For more details about the output form of response speech, see step 530 and its related description.

As mentioned above, the interactive instruction or voice data can be directed to itself or to the user of the communication partner. For details, refer to step 530 and related descriptions.

In the voice interaction scenario, more specifically, in the voice interaction scenario for multilingual users, the terminal can collect voice data, perform semantic recognition on the voice data, and output response words after recognizing the semantics of the user. Among them, the terminal can use a monolingual acoustic model to recognize the semantics of the voice data. However, in multilingual voice interaction scenarios, a monolingual acoustic model cannot meet the voice interaction needs of multilingual users. The multilingual speech classifier in some embodiments of this specification can realize the classification processing of multilingual designated words. On the basis of ensuring the classification effect, it can also convert complex speech recognition problems into simple classification problems. , There is no need to separately train and maintain acoustic models for each language, saving resources and maintenance; and, compared to the separate processing of multilingual acoustic models, the processing efficiency of the classifier is higher, which is conducive to improving the efficiency of speech recognition. It is helpful to improve the response accuracy of response speech, reduce the interference of invalid voice interaction to the user, and improve the voice interaction effect.

Fig. 18 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5-7 and FIG. 17 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.

In some embodiments, the acquisition module 440 may include the acquisition module 1810. The determining module 420 may include a processing module 1820. The response module 430 may include an output module 1830. In some embodiments, the terminal 1800 may include: a collection module 1810, an extraction module 410, a processing module 1820, and an output module 1830.

The collection module 1810 is used to collect current voice data.

The extraction module 410 is used to extract voice features in voice data. In some embodiments, the extraction module 410 is also used to preprocess the voice data before extracting voice features. See Figure 17 and related descriptions for details.

The processing module 1820 is used to process the voice features by using the trained multilingual voice classifier to obtain the classification recognition result.

The output module 1830 is used for outputting the response words for the specified words when the classification recognition result indicates that the voice data contains the specified words. In some embodiments, the output module 1830 can be used to: when the specified speech is directed to itself, directly output the response speech for the specified speech; or, when the specified speech is directed to the opposite user of the current communication, output to the other user Respond to words.

In some embodiments, the acquisition module 440 may include a training module (not shown in FIG. 18), which is used to acquire a multilingual speech classifier through training. For details, refer to FIG. 7 and related descriptions.

In some embodiments, the acquisition module 440 in the terminal 1800 can also be used to acquire historical response words from other users, determine the total output of historical response words, and determine one or more words based on the historical response words.术 label. The output module 1830 is also used to display the total number of outputs and the verbal label.

Fig. 19 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. The second classification model can also be called a trained sentiment classifier (referred to as "sentiment classifier"). In some embodiments, the method of human-computer interaction includes:

Step 1910: Collect current voice data. For details, refer to step 1710, which will not be repeated here.

Step 1920: Recognize the respondent emotion of the voice data; the respondent emotion is obtained from one or more of the text emotion recognition result or the voice emotion recognition result.

As mentioned above, it is possible to process voice data or voice features, and use the processed result as the voice emotion recognition result; process text data or text features, and use the processed result as the text emotion recognition result. Emotion recognition results include speech emotion recognition results and/or text emotion recognition results.

In some embodiments, one or more of text emotion recognition and speech emotion recognition may be performed on the speech data, so that the text emotion recognition result is obtained based on the text emotion recognition, and the speech emotion recognition result is obtained based on the speech emotion recognition, and further, One or more of the two are used to determine the emotion of the voice data. See Figure 20 and related descriptions for details.

Step 1930: Determine the response emotion corresponding to the response emotion state.

As mentioned above, the response emotion is the emotion of the collected voice data sent by the user, and the response emotion is used when responding to the voice data, that is, the emotion in response to the voice.

In some embodiments, the types of emotions involved (including responding emotions and responding emotions) may include, but are not limited to: loss, calm, enthusiasm, or passion, etc., and the actual scene can be customized according to needs. For example, in some embodiments, emotions may also include: joy, sadness, pain, gratification, excitement, etc., and not exhaustively.

In addition, the response emotion and the emotion category contained in the response emotion may be the same or different. For example, the responding emotion and responding emotion are the four emotions of loss, calm, enthusiasm, or passion. For another example, the respondent emotion may include positive emotions (for example, happiness, excitement, joy, excitement, etc.) and negative emotions (for example, loss, sadness, pain, etc.), and the response emotion may only include positive emotions to target the user's negative emotions. Emotionally soothes.

In some embodiments, the corresponding relationship between the response emotion and the response emotion can also be preset. The corresponding relationship may be stored in the terminal in advance, or may be stored in a storage location readable by the terminal, such as the cloud, which is not limited. Specifically, one responding emotion can correspond to one or more responding emotions. For example, if the response emotion is low, the corresponding response emotion can be: joy or comfort.

In step 1940, a response voice for the voice data is output, and the response voice has a response emotion.

As mentioned earlier, response speech can be output by response voice. In some embodiments, the response content for the voice data may be acquired first, and then the response voice may be generated according to the response emotion and the response content, so that the response voice may be output. In this way, the output response voice also has response emotion.

In some embodiments, there is no particular limitation on how the response content is determined. For example, the corresponding relationship between keywords and response content can be preset in advance, so that the response content corresponding to the keyword can be obtained by recognizing the keyword carried in the voice data as the response content of the voice data. For another example, the neural network model can also be used to process voice data, and further, the response content output by the neural network model can be obtained. For another example, the response content can be determined by the method corresponding to FIG. 5 or FIG. 6.

When generating the response voice based on the response content and the response emotion, the default voice (timbre) or the user-selected timbre can be used to generate the response voice. For example, the user may select the timbre of a certain celebrity as the timbre of the response voice, so that the terminal generates the response voice according to the timbre of the celebrity selected by the user. Of course, the premise of this implementation is that the terminal can obtain the tone and authorization of the celebrity, so I will not repeat it.

In some embodiments, it is also possible to generate multiple candidate voices with different emotions in advance for all possible response content, and pre-store these candidate voices in a readable storage location. Therefore, the terminal device only needs to extract a candidate voice corresponding to the response emotion and the response content in the storage location after determining the response emotion, and output it as the response voice. In some embodiments, the candidate voice stored in the storage location may also be manually recorded in advance.

In the voice interaction scenario of the prior art, the terminal generally outputs response data according to the default intonation and tone. This human-computer interaction method has a single response emotion and cannot meet the user's voice emotion needs in a personalized scenario. Some embodiments in this specification can select different response emotions according to the user's emotion in real time, which can effectively improve the matching degree of the response voice and the user's emotion, meet the user's emotional needs in different emotional states, and have a stronger sense of reality and substitution. The voice interaction experience is improved, which also solves the problem of low matching degree between the response voice and the user's emotion in the existing voice interaction scene.

Fig. 20 is an exemplary flowchart of a method for determining a responded emotion according to some embodiments of the present application. In some embodiments, determining the respondent emotion may include the following steps:

Step 1922: Extract the voice features of the voice data.

In some embodiments, the audio features of the voice data can be extracted, and then the audio features are normalized to form a feature vector to obtain the voice features of the voice data. For example, you can extract the fundamental frequency feature, short-term energy feature, short-term assignment feature, and short-term zero-crossing rate feature of speech data, and then normalize these features to form a frame of n-dimensional feature vector, n Is an integer greater than 1. In actual scenarios, the dimensions of feature vectors obtained from different voice data may be different. In other words, the n value of the feature vector can be adjusted according to actual scene or project needs or according to empirical values. There is no restriction on this.

Step 1924: Use the trained emotion classifier to process the voice features to obtain the emotion recognition result.

In some embodiments, the sentiment classifier is used to recognize the sentiment of the speech data. Refer to Figure 9 and related descriptions for the training of the sentiment classifier. For more details about the sentiment classification model, please refer to Figure 8 and its related descriptions.

In step 1926, the emotion indicated by the emotion recognition result is determined to be a response emotion.

The output of the emotion classifier is the result of emotion recognition, and the emotion indicated by the result of emotion recognition is related to the way of expression of the result of emotion recognition.

The emotion recognition result can be a multi-classification result. For example, divide emotions into four categories: loss, calm, enthusiasm, and passion. Exemplarily, the emotion recognition result is the probability of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the emotion with the highest probability; or, the emotion indicated by the emotion recognition result , Is an emotion with an indication mark; or, the emotion recognition result is the score of the voice data in each emotion, and the emotion indicated by the emotion recognition result is the score interval in which the score falls Corresponding emotion.

Specifically, the emotion recognition result may be the emotion probability of the speech data, where the emotion (first emotion) indicated by the emotion recognition result is the emotion with the highest probability. For example, the emotion recognition result output by the emotion classifier may be: loss 2%, calm 20%, enthusiasm 80%, and passion 60%. Then, the emotion indicated by the emotion recognition result is enthusiasm.

In addition, the emotion recognition result can output a multi-classification result with one indicator. At this time, the emotion indicated by the emotion recognition result is an emotion with the indicator. The indication mark can be one or more of words, numbers, characters, and so on. for example. If 1 is an indicator, the emotion recognition result output by the emotion classifier is: loss 1, calm 0, enthusiasm 0, passion 0, then the emotion indicated by the emotion recognition result is: loss.

In addition, emotion recognition results can also output emotion points, and each emotion also corresponds to a different score interval. Therefore, the emotion indicated by the emotion recognition result is a kind of score interval corresponding to the emotion point. emotion.

By determining the response emotion from the voice data, starting from the sound dimension only, the emotion carried in the sound is recognized, and the realization method is simple and feasible.

FIG. 21 is an exemplary flowchart of a method for determining a response emotion according to some embodiments of the present application. In some embodiments, the responding emotion can be determined through the following steps:

Step 1922: Extract the voice features of the voice data.

The processing methods of steps 1922 to 1924 are the same as before, and will not be described in detail.

Step 1926: Convert the voice data into text data.

Steps 1926 to 1928 are used to obtain emotional analysis results from the perspective of content. It should be understood that there is no correlation in the order of execution between steps 1921 to 1924 and steps 1926 to 1928. Except for the sequential execution of steps 1922 and 1924, and the sequential execution of 1926 and 1928, some embodiments of this specification have no relation to the order of execution of these steps. Especially limited, it can be executed sequentially as shown in FIG. 21, or can be executed simultaneously, or after step 1922 is executed, step 1926, etc. are started to be executed, which is not exhaustive.

In some embodiments, the voice data can be converted into text data through a voice decoder, which will not be described in detail.

Step 1928: Perform sentiment analysis on the text data to obtain an sentiment analysis result.

In some embodiments, emotion-related words in the text data can be identified, and then, based on the emotion-related words, the emotion analysis result of the text data is determined. Refer to Figure 8 and its related descriptions for more details about emotional related words.

In addition, the sentiment score can also be preset for each sentiment related word. Therefore, all the emotional related words in the text data can be identified, and then the emotional scores of these emotional related words can be weighted (also can be directly summed or averaged), and then the weighted scores are used as the emotional analysis result.

Step 19210: Determine the responding emotion according to the emotion recognition result and the emotion analysis result.

In some embodiments, if the emotion recognition result and the emotion analysis result are in the form of scores, the two can be weighted (summing or averaging), and then the weighted value falls in the emotional interval corresponding to one This kind of emotion, as the responding emotion. If one or more of the two are not in the form of scoring, the emotion recognition result (or the result of emotion analysis) can be converted into the scoring form according to a preset algorithm, and then weighting is performed to determine the response emotion.

In some embodiments, when the emotion category indicated by the emotion recognition result and the emotion analysis result are the same, the emotion category indicated by the emotion recognition result is taken as the respondent emotion. Or, when the emotion categories indicated by the emotion recognition results and the emotion analysis results are different, the emotion recognition results and the emotion analysis results are weighted, and the emotion categories indicated after the weighting process are used as the responding emotions. (Weighted after conversion into points, as before, no longer repeat).

By determining the response emotion from the voice data and the text data converted from the voice data, it can start from the two dimensions of sound and content (text), and more comprehensively analyze the emotional state of the voice data sent by the user, which is beneficial to improve the accuracy of the recognition result. , In turn, shorten the gap between the response voice and the user’s emotional needs, making it more humane and more realistic.

Fig. 22 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5, 8-9, and FIGS. 19-21 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.

In some embodiments, the determination module 420 may include an identification module 2210 and a response emotion determination module 2220. In some embodiments, the terminal 2200 may include: a collection module 1810, an identification module 2210, a response emotion determination module 2220, and an output module 1830.

The collection module 1810 can also be used to collect current voice data.

The recognition module 2210 is used to recognize the responding emotion of the voice data.

The response emotion determination module 2220 is used to determine the response emotion corresponding to the response emotion state.

The output module 1830 is also used to output a response voice for voice data, and the response voice has a response emotion.

In some embodiments, the recognition module 2210 may be used to extract voice features of voice data. The recognition module 2210 can be used to process voice features using a trained emotion classifier to obtain emotion recognition results. The recognition module 2210 can be used to convert voice data into text data; perform sentiment analysis on the text data to obtain an sentiment analysis result. The recognition module 2210 may be used to determine the emotion analysis result according to the emotion related words recognized from the text data. The recognition module 2210 may also be used to determine the responding emotion according to the emotion recognition result and the emotion analysis result.

In some embodiments, the acquisition module 440 may include a training module (not shown in FIG. 22), and the training module may be used to acquire a second classification model (also referred to as an emotion classifier) through training. For details, refer to FIG. 9 and FIG. Related description.

In some embodiments, the response module 430 may include a generation module (not shown in Fig. 22), and the generation module may be used to generate a response voice according to the response emotion and the response content.

In the existing voice interaction scenario, the terminal will determine the response content corresponding to the voice or text according to the voice or text sent by the user, and output the response content to the user. This processing method can only achieve responses to interactive instructions, and the human-computer interaction method is too monotonous and cannot meet the user's personalized interaction needs. For example, in the aforementioned exaggeration scene, if the user says "praise me", the terminal will output the default compliment content in response to this human-computer interaction command. Different users say "praise me", and the praise content output by the terminal is the same, which is obviously difficult to meet the user's personalized interaction needs, and the human-computer interaction experience is also poor. Some embodiments of this specification can solve the above technical problems of the prior art. For example, Figure 5, Figure 10 or Figure 23 and so on.

Fig. 23 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application. The third classification model can also be called a trained style classifier ("style classifier" for short). In some embodiments, the method of human-computer interaction includes:

Step 2310: Receive an interactive instruction for the target object.

In some embodiments, the interactive instruction refers to receiving voice data or text data from the user. Taking the aforementioned exaggeration scenario as an example, here can be the text data "praise me" received from the user, or the voice data "praise me" sent by the user is collected.

The user may be the user to which the terminal to which it is applied belongs. The target object may be a user to which the current terminal belongs; or, the target object is a counterparty user who communicates with the current terminal. That is, the interactive instruction can be directed to different target objects, and the specific example is similar to that of specifying the words to target different target objects. For details, refer to step 1710, which will not be repeated here.

Step 2320: Determine the response style of the target object, and the response style is related to the historical data of the target object.

For historical data, see Figure 5 and its related descriptions. Historical data can directly or laterally reflect the response style of the target object's personal preference. Therefore, the response style of the target object can be determined based on the historical data. See Figure 24 and related descriptions for details.

Step 2330: Determine the response language according to the response style and the interactive command.

After the response style is determined, based on the aforementioned interactive instructions, a response language with the response style can be obtained. It should be understood that the content of the response words is related to the interactive instructions. For example, if the received interactive instruction is "Praise the driver", then the content of the response word is praise to the driver; if the received interactive instruction is "Praise the passenger", then the content of the response word is Compliments to passengers.

In some embodiments, the candidate speech skills corresponding to each response style can be preset, so that one response speech strategy can be determined from the multiple candidate speech skills corresponding to the determined response style. Refer to step 530 and related descriptions for determining the response language from a plurality of candidate words.

In some embodiments, the priority of the candidate speech corresponding to each response style may be preset, and the candidate speech with a higher priority is preferentially selected as the response speech.

In some embodiments, the interaction instruction may be parsed to obtain the emotional style of the interaction instruction, and then the response style and the emotional style are combined to determine the response style, and furthermore, the response language corresponding to the response style is determined. As mentioned above, the first style is a style determined based on the first feature or/and the second feature. The first feature is a feature determined based on the interactive instruction. In some embodiments, the first style includes the emotional style of the interactive instruction. In some embodiments, the emotional style of the interactive instruction may include the responding emotion of the interactive instruction. For more details, please refer to FIG. 8 or FIG. 19 and related descriptions. For example, when the recognition result of the interactive instruction contains very positive keywords such as "very good" and "extremely praised", the user's personalized emotional style is more inclined to adopt a very enthusiastic praise style.

Combining the response style and the emotional style, when determining the response style, the normalized scores of the two can be weighted by the normalized numerical post-processing of the two (also can be directly summed or averaged), Then, the style corresponding to the weighted score is used as the response style corresponding to the interactive command.

In actual scenes, it is also possible to first determine whether the response style is consistent with the emotional style. If they are the same, the style indicated by the two is the response style. If the two are inconsistent, the aforementioned weighting method can be used to determine the response style.

In some embodiments, the text content of the speech recognition, the behavior of the target object and historical data such as account information are used as one of the reference factors for personalized response speech; in addition, the interactive instructions are parsed separately, As another reference factor for personalized response. Therefore, the two reference factors are comprehensively weighted, and the comprehensive weighted result is used as the final result to evaluate which type of personality tendency of the target object is within a period of time.

It should be understood that this weighting result is not static, but the response style of the target object is regularly updated offline with the continuous update of the target object's usage data, so as to better adapt to the fluctuation of the target object in different periods (premise hypothesis It is that the personality of the target object is not single but diverse, and fluctuates with changes in the environment). Using this weighting method can better fit the personality tendency of the target object, and thus can better make personalized recommendations for the target object.

Step 2340, output response words to the target object.

In this step, based on the determined response words, it is only necessary to output the response words to the target object. Refer to step 530 and related descriptions for more details about the output of the response speech.

In some embodiments of this specification, when the terminal interacts with the user, the response style that the target object may like can be determined according to the historical data of the target object. Therefore, the response style and the interactive instruction can be combined to Determine and output the response words, so that the response words are closer to the personalized style of the target object. Even for the same interactive command, the response words for different target objects may be different, which solves the existing human-computer interaction methods. The problem of singularity and inability to meet the user's personalized interaction needs, which also makes the interaction process more real and interesting.

Fig. 24 is an exemplary flowchart of a method for human-computer interaction according to some embodiments of the present application.

Step 2410: Obtain historical data of the target object.

In some embodiments, only the historical data of the target object in the most recent period of time, for example, the most recent week, the most recent month, or the most recent three days may be acquired, so as to reduce the influence of distant historical data on the response style and make the response style more in line with the user Preferences for the current period of time.

It should be understood that for any target object, its historical data is constantly updated over time. Then, for the same interactive instruction of the same target object, the response words output by the terminal can be the same or different. For example, if the user's preferences change, the terminal's determined response style will be different, and the output response skills may also be different.

Step 2420: Process the historical data to obtain the object characteristics of the target object.

As mentioned earlier, the first feature is determined based on historical data. The first feature can be referred to as an object feature.

In step 2410, text data may be collected, or voice data may be collected. At this time, the text data corresponding to the voice data can be obtained by performing semantic recognition on the voice data.

On this basis, it is only necessary to extract the feature words in the historical data (which have all been converted into text data), and then normalize these feature words and integrate them into feature vectors. Among them, the extracted feature words may include, but are not limited to: word frequency features.

Step 2430: Use the trained style classifier to process the object features to obtain the response style of the target object.

In some embodiments, the style classifier can be used to classify historical data. The style classifier can be trained and acquired. For the training and deployment of the style classifier, see Figure 11 and its related descriptions.

The style classifier can be used offline, and it can be embodied as a model with a small parameter amount. The type of style classifier can be seen in Figure 5 and its related descriptions.

In some embodiments, the style recognition result output by the style classifier may be a multi-classification result. Exemplarily, for ease of explanation, the response styles are divided into three styles according to the degree of praise: strong praise (strong praise, higher degree of praise), normal praise, and slight praise (slight praise, lower degree of praise).

On this basis, the multi-classification result output by the style classifier can be used to identify the style probability of the response style, so that the style with the highest probability indicated by the style classification result is used as the response style of the target object. For example, the style recognition result output by the style classifier may be: exaggerated 70%, normal exaggerated 50%, and exaggerated 10%. Then, the response style indicated by the style recognition result is: exaggerated.

In addition, the style recognition result can output a multi-classification result with one indicator. At this time, the style indicated by the style recognition result is a style with the indicator. The indication mark can be one or more of words, numbers, characters, etc. for example. If 1 is an indicator, and the style recognition result output by the style classifier is: exaggerated 0, normal exaggerated 1, slightly exaggerated 0, then the response style indicated by the style recognition result is: normal exaggeration.

In addition, the style recognition results can also output style scores, and each style also corresponds to a different score interval. Therefore, the style indicated by the style recognition result is the score interval corresponding to the style score. A style.

Fig. 25 is a block diagram of a terminal according to some embodiments of the present application. In some embodiments, FIGS. 5, 10-11, and 23-24 may be executed on a mobile device or terminal (for example, a passenger end or a driver end, etc.), for example, executed by the processor 340 of the mobile device.

In some embodiments, the acquiring module 440 may include a receiving module 2510. The determining module 420 may include a response style determining module 2520. The response module 430 may include a response speech determination module 2530. In some embodiments, the terminal 2500 includes: a receiving module 2510, a response style determining module 2520, a response speech determining module 2530, and an output module 1830.

The receiving module 2510 is used to receive interactive instructions for the target object.

The response style determination module 2520 is used to determine the response style of the target object, and the response style is related to the historical data of the target object. In some embodiments, the response style determination module 2520 may be used to process the acquired historical data to obtain the object characteristics of the target object; and use the trained style classifier to process the object characteristics to obtain the response style of the target object.

The response speech technique determining module 2530 is used to determine the response speech technique according to the response style and the interactive command.

The output module 1830 is used to output response words to the target object.

In some embodiments, the terminal 2500 further includes a training module (not shown in FIG. 25), which is used to train the style classifier. In some embodiments, the training module is also used to determine the style label and update the style classifier.

In some embodiments, the terminal (for example, the terminal 1800, 2200, or 2500) in this specification may be a server or a terminal.

It should be understood that the above module diagram 400 shown in FIG. 4, the terminal 1800 shown in FIG. 18, the terminal 2200 shown in FIG. 22, and the terminal 2500 shown in FIG. All or part of it is integrated into one physical entity, or it can be physically separated. And these modules can all be implemented in the form of software called by processing elements; they can also be implemented in the form of hardware; part of the modules can be implemented in the form of software called by the processing elements, and some of the modules can be implemented in the form of hardware. For example, the extraction module 410 may be a separately established processing element, or it may be integrated in the terminal 1800, for example, implemented in a certain chip of the terminal. In addition, it may also be stored in the memory of the terminal 1800 in the form of a program. A certain processing element calls and executes the functions of the above modules. The implementation of other modules is similar. The implementation of other modules is similar. In addition, all or part of these modules can be integrated together or implemented independently. The processing element described here may be an integrated circuit with signal processing capability. In the implementation process, each step of the above method or each of the above modules can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software. For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as one or more application specific integrated circuits (ASIC), or one or more microprocessors (digital singnal processor, DSP), or, one or more field programmable gate arrays (Field Programmable Gate Array, FPGA), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduler, the processing element may be a general-purpose processor, such as a central processing unit (CPU) or other processors that can call programs. For another example, these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).

The basic concepts have been described above. Obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation to this specification. Although it is not explicitly stated here, those skilled in the art may make various modifications, improvements and amendments to this specification. Such modifications, improvements, and corrections are suggested in this specification, so such modifications, improvements, and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.

Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" mean a certain feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that “one embodiment” or “one embodiment” or “an alternative embodiment” mentioned twice or more in different positions in this specification does not necessarily refer to the same embodiment. . In addition, some features, structures, or characteristics in one or more embodiments of this specification can be appropriately combined.

In addition, unless explicitly stated in the claims, the order of processing elements and sequences, the use of numbers and letters, or the use of other names in this specification are not used to limit the order of the processes and methods in this specification. Although the foregoing disclosure uses various examples to discuss some embodiments of the invention that are currently considered useful, it should be understood that such details are only for illustrative purposes, and the appended claims are not limited to the disclosed embodiments. On the contrary, the rights are The requirements are intended to cover all modifications and equivalent combinations that conform to the essence and scope of the embodiments of this specification. For example, although the system components described above can be implemented by hardware devices, they can also be implemented only by software solutions, such as installing the described system on an existing server or mobile device.

For the same reason, it should be noted that, in order to simplify the expressions disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, multiple features are sometimes combined into one embodiment. In the drawings or its description. However, this method of disclosure does not mean that the subject of the specification requires more features than those mentioned in the claims. In fact, the features of the embodiment are less than all the features of the single embodiment disclosed above.

In some embodiments, numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifier "about", "approximately" or "substantially" in some examples. Retouch. Unless otherwise stated, "approximately", "approximately" or "substantially" indicates that the number is allowed to vary by ±20%. Correspondingly, in some embodiments, the numerical parameters used in the specification and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.

For each patent, patent application, patent application publication and other materials cited in this specification, such as articles, books, specifications, publications, documents, etc., the entire contents are hereby incorporated into this specification as a reference. The application history documents that are inconsistent or conflict with the content of this specification are excluded, and the documents that restrict the broadest scope of the claims of this specification (currently or later appended to this specification) are also excluded. It should be noted that if there is any inconsistency or conflict between the description, definition, and/or use of terms in the auxiliary materials of this manual and the content of this manual, the description, definition and/or use of terms in this manual shall prevail. .

Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other variations may also fall within the scope of this specification. Therefore, as an example and not a limitation, the alternative configuration of the embodiment of the present specification can be regarded as consistent with the teaching of the present specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.

Claims

A method of human-computer interaction, characterized in that it includes:

The associated feature of the target object is extracted based on the interaction instruction for the target object, the associated feature is related to at least one of the interaction instruction and the historical data of the target object; the associated feature includes the interaction instruction correspondence If the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature includes the At least one of a voice feature of the voice data and a text feature of the text data corresponding to the interactive instruction;

Determining a response strategy for the target object based on processing the associated feature, the response strategy being related to at least one of response content, response style, and response emotion; and

Based on the response strategy, a response language for the target object is determined.
The method according to claim 1, wherein the voice characteristics comprise: energy characteristics and audio characteristics of the voice data.
The method of claim 2, wherein:

The audio features include: at least one of a fundamental frequency feature, a short-term energy feature, a short-term assignment feature, and a short-term zero-crossing rate feature;

The energy characteristics include at least one of Fbank characteristics and MFCC characteristics.
The method according to claim 2, wherein the method further comprises:

Before the speech feature is extracted, the speech data is preprocessed, and the preprocessing includes at least one of framing processing, pre-enhancement processing, windowing processing, and denoising processing.
The method according to claim 1, wherein the interactive instruction includes voice data, and the determining a response strategy for the target object based on processing the associated feature comprises:

Process the voice features based on the first classification model, and determine whether the voice data contains a specified dialect of any one of multiple languages; and

In response to the voice data containing the designated speech, a response content corresponding to the designated speech is determined.
The method according to claim 5, wherein the first classification model is obtained through a training process, and the training process comprises:

Acquire a plurality of first training samples, the plurality of first training samples are first sample speech data of the plurality of languages, the plurality of first training samples include a positive sample and a negative sample, and the positive sample is A first sample of voice data related to the specified speech, and the negative sample is a first sample of speech data not related to the specified speech; and

Training the initial first classification model based on the multiple first training samples to obtain the first classification model.
The method according to claim 6, wherein said acquiring a plurality of first training samples comprises:

Converting the first sample voice data into first sample text data based on a voice converter corresponding to the language of the first sample voice data;

Based on the first sample text data, determine whether the first sample voice data is related to the designated speech;

In response to the first sample voice data being correlated with the specified speech, determining the first sample voice data as the positive sample; and

In response to the first sample voice data being unrelated to the specified speech, the first sample voice data is determined to be the negative sample.
The method according to claim 1, wherein the interactive instruction includes voice data, and the determining a response strategy for the target object based on processing the associated feature comprises:

Processing at least one of the voice feature and the text feature based on the second classification model to determine the respondent emotion of the voice data; and

Based on the responded emotion, the response emotion is determined.
The method according to claim 8, wherein the determining the respondent emotion of the voice data based on processing at least one of the voice feature and the text feature comprises:

Processing the voice feature based on the second classification model to determine the first emotion;

Determine the second emotion based on processing the text feature; and

Based on the first emotion and/or the second emotion, the responded emotion is determined.
The method according to claim 9, wherein the text features include keyword features, and the keyword features include emotional related words;

The determining the second emotion based on processing the text feature includes:

Based on the emotion-related words, the second emotion is determined.
The method according to claim 9, wherein the second classification model is obtained through a training process, and the training process comprises:

Acquire a plurality of second training samples, each of the plurality of second training samples includes a second sample speech data and its corresponding emotion label, the emotion label represents the sample response emotion to the second sample speech data ;as well as

Training the initial second classification model based on the multiple second training samples to obtain the second classification model.
The method according to claim 1, wherein the associated feature includes a second feature corresponding to the historical data;

The determining a response strategy for the target object based on processing the associated feature includes:

At least one of the first feature and the second feature is processed based on a third classification model to determine the response style.
The method according to claim 12, wherein the processing at least one of the first feature and the second feature based on a third classification model, and determining the response style of the target object comprises:

Processing at least one of the first feature and the second feature based on the third classification model to determine a first style for the target object;

Determining a second style for the target object based on processing the text feature; and

The response style is determined based on at least one of the first style and the second style.
The method according to claim 12, wherein the third classification model is obtained through a training process, and the training process comprises:

Acquire a plurality of third training samples, each of the plurality of third training samples includes a sample interaction instruction for a sample target object, sample history data of the sample target object, and a corresponding style label, the style label Represents the sample response style for the sample target object; and

Training the initial third classification model based on the plurality of third training samples to obtain the third classification model.
The method according to claim 14, wherein said acquiring a plurality of third training samples comprises:

Acquiring feedback data of the sample target object for sample response words, the sample response words being determined based on the sample interaction instruction;

Obtaining reputation data of the sample target object; and

Based on the reputation data and/or the feedback data, the style label is determined.
The method according to claim 1, wherein the historical data comprises:

At least one of personal account data, behavior data, and offline record data.
The method according to claim 1, wherein the determining a response strategy for the target object based on processing the associated feature comprises:

The associated features are processed through a model to determine a response strategy for the target object; the model is composed of a multi-layer residual network and a multi-layer fully connected network, and the multi-layer residual network is composed of a convolutional neural network ; Or composed of a multilayer convolutional neural network and a multilayer fully connected network.
The method according to claim 1, wherein the target object is a user to which the current terminal belongs; or the target object is a counterpart user who communicates with the current terminal.
The method according to claim 1, wherein the response speech is output to the target object in a response text and/or voice response.
A human-computer interaction system is characterized in that it includes:

The extraction module is configured to extract the associated feature of the target object based on an interactive instruction directed to the target object, the associated feature being related to at least one of the interactive instruction and historical data of the target object; the associated feature includes The first feature corresponding to the interactive instruction; if the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature A feature includes at least one of the voice feature of the voice data and the text feature of the text data corresponding to the interactive instruction;

A determining module, configured to determine a response strategy for the target object based on processing the associated characteristics, where the response strategy is related to at least one of response content, response style, and response emotion; and

The response module is used to determine the response language for the target object based on the response strategy.
A computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the following human-computer interaction method;

The associated feature of the target object is extracted based on the interaction instruction for the target object, the associated feature is related to at least one of the interaction instruction and the historical data of the target object; the associated feature includes the interaction instruction correspondence If the interactive instruction does not include voice data, the first feature includes the text feature of the text data corresponding to the interactive instruction; if the interactive instruction includes voice data, the first feature includes the At least one of a voice feature of the voice data and a text feature of the text data corresponding to the interactive instruction;

Determining a response strategy for the target object based on processing the associated feature, the response strategy being related to at least one of response content, response style, and response emotion; and

Based on the response strategy, a response language for the target object is determined.