CN114154636A

CN114154636A - Data processing method, electronic device and computer program product

Info

Publication number: CN114154636A
Application number: CN202111428251.6A
Authority: CN
Inventors: 朱益; 赵冬迪; 钱能锋; 鲍懋; 韩翀蛟; 王欣
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-08

Abstract

The embodiment of the application provides a data processing method, electronic equipment and a computer program product. The data processing method comprises the following steps: acquiring interaction data of the current interaction, wherein the interaction data comprises at least one of the following data: feature data of the voice of the interactive object, emotion data and interactive environment feature data; determining a language style corresponding to the feedback voice of the current interaction according to the interaction data; determining a feedback statement text according to the language style; and converting the feedback statement text into feedback voice of the language corresponding to the language style so as to interact with the interactive object by using the feedback voice. The method can improve the interaction intelligence.

Description

Data processing method, electronic device and computer program product

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data processing method, electronic equipment and a computer program product.

Background

With the popularization of deep learning, the speech recognition technology is rapidly developed, and meanwhile, intelligent devices (such as intelligent sound boxes) based on the speech recognition technology are becoming more and more popular. For convenience of use, the smart device may wake up by voice and interact with the user by voice. However, the feedback of voice interaction of the current intelligent equipment is single, and the intelligent equipment cannot adapt to complex and variable use environments.

Disclosure of Invention

In view of the above, embodiments of the present application provide a data processing scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a data processing method, including: acquiring interaction data of the current interaction, wherein the interaction data comprises at least one of the following data: feature data of the voice of the interactive object, emotion data and interactive environment feature data; determining a language style corresponding to the feedback voice of the current interaction according to the interaction data; determining a feedback statement text according to the language style; and converting the feedback statement text into feedback voice of the language corresponding to the language style so as to interact with the interactive object by using the feedback voice.

According to a second aspect of embodiments of the present application, there is provided a data processing apparatus comprising: an obtaining module, configured to obtain interaction data of a current interaction, where the interaction data includes at least one of: feature data of the voice of the interactive object, emotion data and interactive environment feature data; the first determining module is used for determining the language style corresponding to the feedback voice of the current interaction according to the interaction data; the second determining module is used for determining a feedback statement text according to the language style; and the conversion module is used for converting the feedback statement text into feedback voice of the tone corresponding to the language style so as to use the feedback voice to interact with the interactive object.

According to a third aspect of embodiments of the present application, there is provided an intelligent speech device, including: the system comprises a loudspeaker and a processor, wherein the processor is used for acquiring interaction data of the current interaction, and the interaction data comprises at least one of the following data: feature data of the voice of the interactive object, emotion data and interactive environment feature data; determining a language style corresponding to the feedback voice of the current interaction according to the interaction data; determining a feedback statement text according to the language style; and converting the feedback statement text into feedback voice of the tone corresponding to the language style, and sending the feedback voice to the loudspeaker, wherein the loudspeaker is used for playing the feedback voice so as to interact with the interactive object.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to the first aspect.

According to a fifth aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect.

According to a sixth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method as described above.

According to the scheme provided by the embodiment of the application, based on the emotion of the interactive object indicated by the interactive data or the information of the interactive environment, the language style suitable for the feedback voice is selected, the corresponding feedback statement text is determined according to the language style, and the feedback statement text is converted into the feedback voice of the language style corresponding to the language style, so that the feedback is dynamically carried out by adopting different language styles in different scenes, the interactive object is easier to understand and accept, the interaction is more consistent, and the more natural and more intelligent interaction between the interactive object and the intelligent equipment is realized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic view of a usage scenario according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of feature data for speech of different linguistic styles in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a flow of step 204 of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a flow of step 204 of a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a flow of step 204 of a data processing method according to an embodiment of the present application;

FIG. 7 is a flow diagram illustrating a usage scenario according to an embodiment of the present application;

FIG. 8 is a block diagram of a data processing apparatus according to a fourth embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

The present application relates to schemes for intelligent interaction based on speech and mood etc. of an interacting object, such as a user. The voice of the interactive object can be collected through a radio device such as a microphone mounted on the smart device, and can also be acquired from other devices (such as a mobile phone, a PAD, a computer, and the like) connected with the smart device. The smart device is, for example, a smart speaker, but other devices capable of interacting with the interactive object, such as a smart television, a smart watch, and the like, are also possible.

Taking a smart sound box as an example, fig. 1 shows a scene schematic diagram of interaction between an interaction object and the smart sound box. Wherein, the interactive object controls the smart speaker 100 or other devices connected to the smart device through the voice command. For example, the interactive object controls the smart sound box 100 to play music or describe weather information of the location through a voice command. Or, the interactive object controls a lamp connected to the smart sound box 100 to turn on or off through a voice instruction, or controls a curtain connected to the smart sound box 100 to roll up or down, and the like.

In the process that the interactive object interacts with the intelligent device through voice, the intelligent device needs to feed back the voice of the interactive object through a voice mode, so that the interactive object can know whether the voice command of the intelligent device is received and responded, or in some scenes, the voice command of the interactive object has a situation of missing information, and the intelligent device may need to talk with the interactive object through a voice feedback mode to acquire needed information so as to respond to the voice command. For these situations, the smart device is required to be able to perform voice feedback, and the existing voice feedback of the smart device is basically based on tts (text to speech) technology, i.e. simply converting the text of the feedback into voice and reading the voice. The mode causes that the feedback voice of the intelligent equipment is monotonous, the voice sound size is not appropriate, and the like, so that bad experience is easily caused to the interactive object.

In order to solve the problem, an embodiment of the present application provides a data processing method, as shown in fig. 2, the method includes the following steps:

step S202: and acquiring the interactive data of the current interaction.

An interaction process of an interactive object and a smart object may include multiple rounds of interaction, for example, a question and a response may be a round of interaction (also may be considered as an interaction). The current interaction may be the latest interaction between the interaction object and the smart device, and of course, an interaction of a turn that needs to be processed may also be selected from the completed interaction turns as the current interaction, which is not limited in this embodiment.

The interaction data comprises at least one of: feature data of the speech of the interactive object, emotion data and interactive environment feature data.

The voice data of the interactive object may be voice collected by a microphone mounted on the smart device, voice read from a storage device, voice acquired from another device connected to the smart device, or the like. The characteristic data of the speech may be a spectrum of the speech. By appropriate processing of the speech, which may be determined as desired, a spectrum of the speech may be obtained, which spectrum to a certain extent indicates a language style that the interactive object considers suitable in the current environment, or which spectrum is said to represent a language style that the interactive object prefers. FIG. 3 illustrates feature data for three different linguistic styles of speech. Linguistic styles include, but are not limited to: normal, stolen whisper, and auditory feedback.

Wherein, the pitch, intonation, etc. of the normal indication voice are moderate, and the stealing whisper indicates that the pitch of the voice is low. Auditory feedback mainly includes pitch, duration, intensity, timbre, and psychological mechanisms for feeding back the sense of sound. Auditory feedback capability, an important component of human abstract linguistic capability, plays an extremely important role in speech-to-utterance interaction. The self-detection is a processing mechanism of auditory feedback, for example, when environmental noise increases, the acoustic characteristics of the voice of the interactive object may change due to physiological and psychological influences, for example, the sound intensity increases. This phenomenon is also known as the robad effect.

The emotion data may be acquired based on voice, expression, limb movement, etc. of the interactive object. For example, the interactive object may smile when happy or produce happy micro-expressions, may have frown, etc. Its emotion is also reflected in speech, such as lowered intonation. In a specific implementation, images of the interactive object can be collected, and multimodal analysis can be performed based on the images and speech to obtain multimodal emotion data. If the interactive object and the smart device have performed one or more rounds of interaction, the emotion data characterizes the interactive object's preference for the previous feedback speech of the smart device to some extent, such as whether the interactive object likes this language style.

The interactive environmental feature data may be obtained based on the captured environmental sounds. The interactive environment feature data is indicative of at least a magnitude of noise in the interactive environment, but is not so limited.

Step S204: and determining the language style corresponding to the feedback voice of the current interaction according to the interaction data.

In order to enable the feedback voice fed back to the interactive object to be more anthropomorphic and intelligent and improve the interaction richness and the interaction trueness, the language style corresponding to the feedback voice is determined according to the interaction data, so that the feedback voice is more in line with the scene.

For example, in one specific implementation, as shown in fig. 4, step S204 may be implemented by sub-step S2041 and sub-step S2042 described below.

Substep S2041: and determining the language style of the interactive object according to the feature data of the voice in the interactive data.

For example, the feature data of the speech is matched with preset feature data corresponding to different language styles, and the language style corresponding to the preset feature data with the highest similarity is obtained as the language style of the feature data of the speech in the interactive data.

Alternatively, in other examples, feature data of speech may be input into a neural network model, recognized by the neural network model, and output in a recognized linguistic style. Alternatively, the language style may be determined in other suitable manners, and is not limited thereto.

The language style may be normal, whisper, auditory feedback, and the like. The feature data (spectrum) with the normal language style is shown in fig. 3 (a), the feature data with the language style of stealing whisper is shown in fig. 3 (c), and the feature data with the language style of auditory feedback is shown in fig. 3 (b).

Substep S2042: and determining the language style corresponding to the feedback voice of the current interaction according to the interaction environment feature data, the language style of the interaction object and the emotion feature data in the interaction data.

In an example, the interaction environment feature data may indicate relevant attributes of the interaction object (e.g., male, female, years, etc.) in addition to the magnitude of noise in the interaction environment. And determining the language style corresponding to the feedback voice of the current interaction based on the interaction environment characteristic data, the language style of the interaction object and the emotion characteristic data.

For example, if the interactive environment feature data indicates that the noise of the environment is large, the language style of the interactive object is "auditory feedback" and the emotion data of the interactive object indicates "unpleasant", it indicates that the interactive environment is noisy, the interactive object is more difficult to hear the feedback voice of the smart device, or the language style of the previous feedback sentence of the interactive object is less satisfactory, so the language style corresponding to the feedback voice of the current interaction may be "auditory feedback", that is, some correction needs to be performed in the current feedback sentence.

Step S206: and determining a feedback statement text according to the language style.

In an example, as shown in fig. 5, step S206 may be implemented by sub-step S2061 and sub-step S2062.

Substep S2061: and determining candidate feedback statement texts matched with the semanteme of the voice of the interactive object from the candidate feedback statement texts corresponding to the language style.

Different linguistic styles may preset one or more candidate feedback sentence texts. For example, the candidate feedback sentence text preset by the normal language style is, for example: "I am, you say", "on woollen, what" and so on. The candidate feedback statement text preset by the language style of stealing whisper is, for example: "on", and the like. The language style of the auditory feedback presets candidate feedback sentence texts such as: "just I just got a bit louder, what you needed", "little I got a little louder, how you can say again", etc.

The same linguistic style may contain a plurality of candidate feedback sentence texts of different semantics to cope with different conversations. If the voice style of the current interaction is auditory feedback, candidate feedback statement texts corresponding to the auditory feedback can be selected, and then candidate feedback statement texts matched with the semantics of the voice of the interaction object are selected from the candidate feedback statement texts, so that the situation that the answer is not asked is prevented.

A specific implementation is for example: and selecting candidate feedback statement texts with semantics matched with the voice of the interactive object from the candidate feedback statement texts by using the trained neural network model. The neural network model can learn the context information of the interactive process, and then select candidate feedback statement texts meeting the semantic meaning of the voice of the interactive object.

Substep S2062: and determining the feedback statement text according to the candidate feedback statement text.

The way in which the feedback sentence text is determined from the candidate feedback sentence text may be different for different linguistic styles. For example, if the language style is normal or whisper, the candidate feedback sentence text may be taken as the feedback sentence text. For another example, if the language style is auditory feedback, the insert language may be acquired; and determining the feedback statement text according to the insertion words and the candidate feedback statement text. The term "inserted" means "… …, including hiccup, kay, o, contra, and i. And adding the insertion words into the appropriate positions of the candidate feedback sentence texts, and taking the candidate feedback sentence texts added with the insertion words as the feedback sentence texts.

For example, the candidate feedback statement text is "just i have a bit loud, what you need", the insert is "hiccup", then the feedback statement text may be "hiccup, just i have a bit loud, what you need? ".

Step S208: and converting the feedback statement text into feedback voice of the language corresponding to the language style so as to interact with the interactive object by using the feedback voice.

In a specific implementation, as shown in fig. 6, step S208 may be implemented by the following sub-steps.

Substep S2081: and inputting the feedback statement text into a neural network model to obtain candidate feedback voice output by the neural network model.

For example, the feedback statement text is input into a neural network model, and candidate feedback voices of the plurality of languages with different styles and corresponding moods output by the neural network model are obtained.

In this embodiment, the neural network model is used to convert the input candidate feedback sentence text into candidate feedback sentences of different language styles.

Of course, in other embodiments, the neural network model may be used only for outputting the feedback statement corresponding to the language style of the current interaction, which is not limited to this.

As shown in FIG. 7, one useful neural network model includes an attention-based seq2seq network and a Recurrent Neural Network (RNN). It can be obtained by improvement (fine-tuning) based on a pre-trained neural network model with generalization capability, which is obtained by training with training samples of different language styles. The feedback sentence text is input into an attention-based seq2seq network of a trained neural network model, processed by the network into a mel-frequency spectrum, which is input into a recurrent neural network, which is further processed by the recurrent neural network, so that candidate feedback voices of different linguistic styles, which are represented in the form of waveforms, are output.

For example, the feedback sentence text is input into a neural network model that outputs a waveform corresponding to a normal language style, a waveform corresponding to a language style of stealing whisper, and a waveform corresponding to a language style of auditory feedback.

Optionally, the neural network model of this embodiment may use the interactive voice of the historical turn and the text of the feedback statement to perform iterative training in the interaction process with the interactive object, so that it may better interact with the interactive object.

Substep S2082: and determining the feedback voice which accords with the tone corresponding to the language style according to the candidate feedback voice.

If the neural network model outputs feedback speech of a plurality of different language styles, sub-step S2082 may be implemented as: selecting candidate feedback voices corresponding to the currently interacted language style from the candidate feedback voices corresponding to the plurality of different language styles as target feedback voices, processing the target feedback voices, and taking the processed target feedback voices as the feedback voices according with the language style corresponding to the language style.

For example, if the language style of the current interaction is stealing whisper, the candidate feedback voice corresponding to the stealing whisper is selected as the target feedback voice. In order to further improve the human simulation and the reality, the target feedback voice can be processed to be more consistent with the context of the current interaction and the semantic meaning which is expected to be expressed.

In a specific example, processing the target feedback speech may include: and performing at least one of fundamental frequency adjustment, energy transfer, vowel elongation and formant adjustment on the target feedback voice according to at least one of the interactive environment feature data and the semantics of the feedback statement text.

Since the noise of the environment background has no significant influence on the speech speed and the pitch, but has a significant influence on the pitch, if the interactive environment feature data indicates that the noise is large, the pitch can be enhanced by performing at least one of the processing of fundamental frequency adjustment, energy transfer, vowel elongation, formant adjustment, and the like on the target feedback speech, so that the interactive object can hear the feedback speech more easily.

Because the semantics and the context in the feedback statement text have influence on the sound intensity and the pitch, when different semantics need to be expressed, at least one of fundamental frequency adjustment, energy transfer, vowel elongation, formant adjustment and other adjustments can be performed on the target feedback speech, so that the feedback speech obtained by processing meets the semantics needing to be expressed.

After the feedback voice is obtained, the feedback voice can be played through equipment such as a loudspeaker carried on the intelligent equipment, so that the interactive object can hear the feedback voice. Of course, in other embodiments, the feedback voice may also be sent to other connected devices through a network or the like, and the other devices play the feedback voice as long as the interactive object can hear the feedback voice.

The method can realize self-detection of the intelligent equipment, the self-detection is auditory feedback, and in the interaction process, when the intelligent equipment determines that the feedback of the interactive object does not meet the requirement of the intelligent equipment or the feedback is wrong, the intelligent equipment can realize the error in the self-feedback voice and carry out self-correction (self-interruption) according to self-correction logic processing (such as self-interruption) or adding insertion words (inserting expressions) and carry out self-correction (self-repairs) on the feedback voice, so that the final feedback voice is more emotional and anthropomorphic.

Therefore, the intelligent device can perform self monitoring and feedback on the feedback voice of the intelligent device based on the voice, emotion and the like of the interactive object, and perform feedback of different voices by using limited data based on a proper network architecture and a proper data processing mode; meanwhile, the auditory feedback mechanism is utilized, so that the feedback voice is more emotional and natural.

The method and the device can adjust the language style of feedback interaction through dynamic feedback interaction under different contexts and interaction environments so as to adapt to the current contexts and interaction environments. Such as: in a quieter interactive environment, the voice interaction between the interactive object and the intelligent device may be a more personal and private manner, while in a more busy interactive environment, the voice interaction may be an interactive manner more suitable for a noisy environment, for example, having a stronger sound intensity and a unique pitch.

In summary, the method of this embodiment selects a language style suitable for the feedback speech based on the emotion of the interactive object indicated by the interactive data or the information of the interactive environment, determines a corresponding feedback sentence text according to the language style, and converts the feedback sentence text into the feedback speech of the language style corresponding to the language style, thereby realizing that different language styles are dynamically adopted for feedback in different scenes, so that the interactive object is easier to understand and accept, the interaction is more harmonious, and the more natural and more intelligent interaction between the interactive object and the intelligent device is realized.

The method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example two

Referring to fig. 8, a block diagram of a data processing apparatus according to a second embodiment of the present application is shown.

The device includes:

an obtaining module 802, configured to obtain interaction data of a current interaction, where the interaction data includes at least one of: feature data of the voice of the interactive object, emotion data and interactive environment feature data;

a first determining module 804, configured to determine, according to the interaction data, a language style corresponding to the feedback voice of the current interaction;

a second determining module 806, configured to determine a feedback sentence text according to the language style;

a conversion module 808, configured to convert the feedback statement text into a feedback voice of a mood corresponding to the language style, so as to interact with the interactive object using the feedback voice.

Optionally, the first determining module 804 is configured to determine a language style of the interactive object according to feature data of a voice in the interactive data; and determining the language style corresponding to the feedback voice of the current interaction according to the interaction environment feature data, the language style of the interaction object and the emotion feature data in the interaction data.

Optionally, the second determining module 806 is configured to determine, from the candidate feedback statement texts corresponding to the language style, a candidate feedback statement text that matches the semantics of the speech of the interactive object; and determining the feedback statement text according to the candidate feedback statement text.

Optionally, the second determining module 806 is configured to, when determining the feedback statement text according to the candidate feedback statement text, obtain an insertion word if the language style is an auditory feedback style; and determining the feedback statement text according to the insertion words and the candidate feedback statement text.

Optionally, the conversion module 808 is configured to input the feedback statement text into a neural network model to obtain a candidate feedback speech output by the neural network model; and determining the feedback voice which accords with the tone corresponding to the language style according to the candidate feedback voice.

Optionally, the conversion module 808 is configured to, when the feedback statement text is input into a neural network model to obtain candidate feedback voices output by the neural network model, input the feedback statement text into the neural network model to obtain candidate feedback voices, output by the neural network model, of voices corresponding to a plurality of different language styles; the conversion module 808 is configured to, when determining, according to the candidate feedback speech, a feedback speech that conforms to the mood corresponding to the language style, select, from the candidate feedback speech corresponding to a plurality of different language styles, a candidate feedback speech corresponding to the language style of the current interaction as a target feedback speech, and process the target feedback speech, so that the processed target feedback speech is taken as a feedback speech that conforms to the mood corresponding to the language style.

Optionally, the conversion module 808 is configured to, when the target feedback speech is processed, perform at least one of fundamental frequency adjustment, energy transfer, vowel elongation, and formant adjustment on the target feedback speech according to at least one of the interaction environment feature data and semantics of the feedback sentence text.

The apparatus of this embodiment is used to implement the corresponding method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the apparatus of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not described herein again.

EXAMPLE III

In this embodiment, an intelligent voice device is provided, including: the system comprises a loudspeaker and a processor, wherein the processor is used for acquiring interaction data of current interaction; determining a language style corresponding to the feedback voice of the current interaction according to the interaction data; determining a feedback statement text according to the language style; and converting the feedback statement text into feedback voice of the tone corresponding to the language style, and sending the feedback voice to the loudspeaker, wherein the loudspeaker is used for playing the feedback voice so as to interact with the interactive object.

The intelligent voice device can be a device which is provided with a loudspeaker and a processor, such as an intelligent sound box, an intelligent watch, an intelligent television, an intelligent projector and the like. The processor is used for processing the data, for example, the processor may receive speech, ambient sound and other multi-modal data of the interaction object, and based on the data, process to obtain interaction data of the current interaction. The interaction data comprises at least one of: feature data of the speech of the interactive object, emotion data and interactive environment feature data.

The processor determines a language style of the current interaction, such as normal, stolen whisper, or auditory feedback, based on the interaction data. The processor then determines a corresponding feedback sentence text based on the linguistic style. For example, if the linguistic style is auditory feedback, the text of the feedback sentence may contain some stop words such as "kayen … …" or some modification to the previous feedback sentence such as "don't mean" or the like. The determined feedback statement text is converted into the feedback voice with the language style corresponding to the tone, so that the tone and the language of the feedback voice are more diversified and richer, and the interaction is more anthropomorphic.

The loudspeaker is used for playing feedback voice to realize interaction with the interaction object. The intelligent equipment can interact with an interactive object in a more anthropomorphic and intelligent mode, so that the interactive effect is improved.

Example four

Referring to fig. 9, a schematic structural diagram of an electronic device according to a fourth embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 9, the electronic device may include: a processor (processor)902, a communication Interface 904, a memory 906, and a communication bus 908.

Wherein:

the processor 902, communication interface 904, and memory 906 communicate with one another via a communication bus 908.

A communication interface 904 for communicating with other electronic devices or servers.

The processor 902 is configured to execute the program 910, and may specifically perform the relevant steps in the above method embodiments.

In particular, the program 910 may include program code that includes computer operating instructions.

The processor 902 may be a processor CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

A memory 906 for storing a program 910. The memory 906 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 910 may be specifically configured to enable the processor 902 to execute operations corresponding to the above-described methods.

For specific implementation of each step in the program 910, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The embodiment of the present application further provides a computer program product, which includes computer instructions for instructing a computing device to execute an operation corresponding to any one of the methods in the foregoing method embodiments.

Embodiments of the present application also provide a computer storage medium, on which a computer program is stored, which when executed by a processor implements the method as described above.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A method of data processing, comprising:

acquiring interaction data of the current interaction, wherein the interaction data comprises at least one of the following data: feature data of the voice of the interactive object, emotion data and interactive environment feature data;

determining a language style corresponding to the feedback voice of the current interaction according to the interaction data;

determining a feedback statement text according to the language style;

and converting the feedback statement text into feedback voice of the language corresponding to the language style so as to interact with the interactive object by using the feedback voice.

2. The method of claim 1, wherein the determining, according to the interaction data, a linguistic style corresponding to the feedback speech of the current interaction comprises:

determining the language style of the interactive object according to the feature data of the voice in the interactive data;

and determining the language style corresponding to the feedback voice of the current interaction according to the interaction environment feature data, the language style of the interaction object and the emotion feature data in the interaction data.

3. The method of claim 1, wherein the determining feedback sentence text from the linguistic style comprises:

determining candidate feedback statement texts matched with the semantics of the voice of the interactive object from the candidate feedback statement texts corresponding to the language style;

and determining the feedback statement text according to the candidate feedback statement text.

4. The method of claim 3, wherein the determining the feedback sentence text from the candidate feedback sentence text comprises:

if the language style is an auditory feedback style, acquiring an insert language;

and determining the feedback statement text according to the insertion words and the candidate feedback statement text.

5. The method of claim 1, wherein said converting the feedback sentence text into feedback speech of a mood corresponding to the linguistic style comprises:

inputting the feedback statement text into a neural network model to obtain candidate feedback voice output by the neural network model;

and determining the feedback voice which accords with the tone corresponding to the language style according to the candidate feedback voice.

6. The method of claim 5, wherein the text inputting the feedback statement into a neural network model to obtain a candidate feedback speech output by the neural network model comprises:

inputting the feedback statement text into a neural network model to obtain candidate feedback voices of the languages with different language styles and corresponding to the languages output by the neural network model;

the determining, according to the candidate feedback speech, a feedback speech that conforms to a mood corresponding to the language style includes:

selecting candidate feedback voices corresponding to the currently interacted language style from the candidate feedback voices corresponding to the plurality of different language styles as target feedback voices, processing the target feedback voices, and taking the processed target feedback voices as the feedback voices according with the language style corresponding to the language style.

7. The method of claim 6, wherein the processing the target feedback speech comprises:

and performing at least one of fundamental frequency adjustment, energy transfer, vowel elongation and formant adjustment on the target feedback voice according to at least one of the interactive environment feature data and the semantics of the feedback statement text.

8. An intelligent speech device comprising: the system comprises a loudspeaker and a processor, wherein the processor is used for acquiring interaction data of the current interaction, and the interaction data comprises at least one of the following data: feature data of the voice of the interactive object, emotion data and interactive environment feature data; determining a language style corresponding to the feedback voice of the current interaction according to the interaction data; determining a feedback statement text according to the language style; and converting the feedback statement text into feedback voice of the tone corresponding to the language style, and sending the feedback voice to the loudspeaker, wherein the loudspeaker is used for playing the feedback voice so as to interact with the interactive object.

9. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction which causes the processor to execute the corresponding operation of the method according to any one of claims 1-7.

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 7.

11. A computer program product comprising computer instructions to instruct a computing device to perform operations corresponding to the method of any of claims 1-7.