CN113689881A

CN113689881A - Method, device and storage medium for audio interaction aiming at voice image

Info

Publication number: CN113689881A
Application number: CN202010420263.3A
Authority: CN
Inventors: 白安琪; 蒋宁; 赵立军; 陈燕丽
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2021-11-23

Abstract

The application discloses a method, a device and a storage medium for audio interaction aiming at voice styles. Wherein, the method comprises the following steps: receiving first audio generated by an interactive object in an interactive process; determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object; determining a voice style corresponding to the interactive object according to the voice image; and determining a second audio for replying to the first audio based on the voice style.

Description

Method, device and storage medium for audio interaction aiming at voice image

Technical Field

The present application relates to the field of internet artificial intelligence technologies, and in particular, to a method, an apparatus, and a storage medium for performing audio interaction on a voice image.

Background

TTS is a process of converting languages from a character carrier to a sound carrier by a machine, and is a key module in systems of man-machine conversation, intelligent broadcasting and the like. The relevant research is based on its two ends: the text end requires that language units are properly divided and combined, the qualitative accuracy is high, the intelligibility is high, the voice end requires strong simulation capability, and the focus of the aspects lies in the social attribute and the physiological attribute of the voice. Therefore, theoretical schemes and practical techniques such as language unit segmentation, qualitative analysis, syntactic analysis, prosodic boundary prediction and the like are widely concerned by researchers. With the development of the speech synthesis technology, the speech with a single style can not meet the requirements of people, so that software and hardware products with different speech synthesis functions such as gender, age, regional dialect and the like are produced; more preferably, customized schemes for the autonomous selection of speech styles and for the formation of speech mashups are provided, which are all of interest in the physical properties of speech. However, the existing audio interactive system lacks for analyzing the audio of the interactive object and then formulating the reply audio corresponding to the audio feature of the interactive object. In addition, in the prior art, different voice styles can be selected according to the requirements of users in the interaction process, and the corresponding style cannot be automatically determined to interact according to the audio style of the other party, so that the experience effect of the interactive object is influenced.

Aiming at the technical problems that the audio characteristics of an interactive object are not analyzed in the human-computer interaction process in the prior art, and a proper audio interaction style cannot be actively selected according to the audio characteristics of the interactive object, so that the experience effect of the interactive object is influenced, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device and a storage medium for audio interaction aiming at voice images, so as to at least solve the technical problems that in the prior art, the audio characteristics of an interactive object are not analyzed in the human-computer interaction process, and a proper audio interaction style cannot be actively selected according to the audio characteristics of the interactive object, so that the experience effect of the interactive object is influenced.

According to an aspect of the embodiments of the present disclosure, there is provided a method for audio interaction for voice style, including: receiving first audio generated by an interactive object in an interactive process; determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object; determining a voice style corresponding to the interactive object according to the voice image; and determining a second audio for replying to the first audio based on the voice style.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for audio interaction for voice style, including: the audio receiving module is used for receiving first audio generated by the interactive object in the interactive process; the portrait determining module is used for determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object; the style determining module is used for determining a voice style corresponding to the interactive object according to the voice image; and an audio determination module to determine a second audio for replying to the first audio based on the voice style.

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for audio interaction for voice style, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: receiving first audio generated by an interactive object in an interactive process; determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object; determining a voice style corresponding to the interactive object according to the voice image; and determining a second audio for replying to the first audio based on the voice style.

In the embodiment of the disclosure, the server can analyze the voice portrait of the interactive object from the audio of the interactive object, further select a proper voice style for the voice portrait of the interactive object, and finally continue to interact with the interactive object by using the phonetics of the voice style. Compared with the prior art, the method and the device can automatically analyze, predict and select the voice without the need of self-selection of the dialect style by the interactive object. Thus, the distance to the interactive object can be drawn without feeling the interactive object. Furthermore, the technical problem that the experience effect of the interactive object is influenced because the audio characteristics of the interactive object are not analyzed in the human-computer interaction process and a proper audio interaction style cannot be selected actively according to the audio characteristics of the interactive object in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;

fig. 2 is a schematic flowchart of a method for audio interaction for voice styles according to embodiment 1 of the present disclosure;

FIG. 3 is a flow chart of a speech analysis process according to embodiment 1 of the present disclosure;

FIG. 4 is a flow chart of model training according to embodiment 1 of the present disclosure;

fig. 5 is a flowchart of speech style prediction according to embodiment 1 of the present disclosure;

fig. 6 is an overall flowchart of a method for audio interaction for voice styles according to embodiment 1 of the present disclosure;

fig. 7 is a schematic diagram of an apparatus for audio interaction for speech styles according to embodiment 2 of the present disclosure; and

fig. 8 is a schematic diagram of an apparatus for audio interaction for voice styles according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to the present embodiment, there is provided an embodiment of a method for audio interaction for speech portrayal, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system, such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than presented herein.

The method embodiments provided by the present embodiment may be executed in a server or similar computing device. Fig. 1 shows a block diagram of a hardware architecture of a computing device for implementing a method for audio interaction for speech portraits. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for performing audio interaction for voice images in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implements the method for performing audio interaction for voice images of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

In the operating environment, according to the first aspect of the present embodiment, a method for audio interaction for voice images is provided, which may be applied to a robot customer service system, for example, and a server of the system may automatically select an adaptive voice style for reply according to voice characteristics of an interaction object. Fig. 2 shows a flow diagram of the method, which, with reference to fig. 2, comprises:

s202: receiving first audio generated by an interactive object in an interactive process;

s204: determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object;

s206: determining a voice style corresponding to the interactive object according to the voice image; and

s208: determining a second audio for replying to the first audio based on the speech style

As described in the background art, with the development of speech synthesis technology, a single style of speech has not been able to meet the needs of people, so that software and hardware products with differentiated speech synthesis of gender, age, regional dialect, etc. have come up at the mercy; more preferably, customized schemes for the autonomous selection of speech styles and for the formation of speech mashups are provided, which are all of interest in the physical properties of speech. However, the existing audio interactive system lacks for analyzing the audio of the interactive object and then formulating the reply audio corresponding to the audio feature of the interactive object. In addition, in the prior art, different voice styles can be selected according to the requirements of users in the interaction process, and the corresponding style cannot be automatically determined to interact according to the audio style of the other party, so that the experience effect of the interactive object is influenced.

To solve the technical problem in the background art, referring to fig. 3, in step S202, according to the technical solution of the present embodiment, a server first receives a first audio generated by an interactive object during an interactive process. In one particular example, the first audio may be, for example, audio of an interactive object telephone complaint.

Further, in step S204, the server analyzes the first audio to determine a speech portrait corresponding to the interactive object, wherein the speech portrait is used to describe audio characteristics of the interactive object. For example: the sex of the interactive object, the age of the interactive object, the dialect of the interactive object, the speech rate of the interactive object, the style of the interactive object, the emotion of the interactive object, the volume and the like form a voice portrait corresponding to the interactive object. In addition, other voice characteristics related to the person may be included, and are not specifically limited herein. In one embodiment, the voice portrayal of the interactive object is the Sichuan dialect, the older man with slower speech speed.

Further, in step S206, the server determines a voice style corresponding to the interactive object according to the voice image. For example: aiming at the voice image of the interactive object (namely, the Szechwan dialect and the old man voice with slow voice speed), the determined voice style is the slow spoken Szechwan dialect.

Finally, in step S208, the server determines a second audio for replying to the first audio based on the voice style. For example: the server determines that the second audio of the slow spoken Sichuan dialect is used for interacting with the interactive object, namely the system replies the complaint call of the interactive object by using the voice style of the slow spoken Sichuan dialect through the voice robot.

Therefore, the server can analyze the voice portrait of the interactive object from the audio frequency of the interactive object, further select a proper voice style aiming at the voice portrait of the interactive object, and finally continue to interact with the interactive object by utilizing the phonetics of the voice style. Compared with the prior art, the method and the device can automatically analyze, predict and select the voice without the need of self-selection of the dialect style by the interactive object. Thus, the distance to the interactive object can be drawn without feeling the interactive object. Furthermore, the technical problem that the experience effect of the interactive object is influenced because the audio characteristics of the interactive object are not analyzed in the human-computer interaction process and a proper audio interaction style cannot be selected actively according to the audio characteristics of the interactive object in the prior art is solved.

Optionally, determining a speech representation corresponding to the interactive object based on the first audio comprises: acquiring first sound information related to sound elements from first audio; acquiring second sound information related to the interactive object from the first audio, wherein the second sound information is used for describing a sound characteristic attribute corresponding to the interactive object; and determining a voice portrait corresponding to the interactive object according to the first sound information and the second sound information.

Specifically, referring to fig. 3, in the operation of determining the speech portrait corresponding to the interactive object according to the first audio, the server obtains first sound information related to sound elements from the first audio, where the first sound information is elements of sound, such as: sound velocity value, sound volume value and other related factors. And the server further obtains second sound information related to the interactive object from the first audio, wherein the second sound information is used for describing a sound characteristic attribute corresponding to the interactive object, and in a specific example, the sound characteristic attribute includes at least one of the following: gender of the interactive object, age of the interactive object, dialect of the interactive object, tone quality of the interactive object, and mood of the interactive object. Finally, the server determines the voice portrait corresponding to the interactive object according to the first sound information and the second sound information. In this way, the server can analyze the first audio of the interactive object from multiple dimensions, and finally determine the voice portrait of the interactive object, so that the sound analysis of the interactive object is more comprehensive.

Optionally, obtaining second sound information related to the interactive object from the first audio includes: determining an audio feature corresponding to the first audio; recognizing the audio features by utilizing a pre-trained decision tree model for predicting the voice features of the user, and determining attribute values corresponding to the voice feature attributes; and determining the second sound information according to the attribute value corresponding to the sound characteristic attribute.

Specifically, referring to fig. 3, in the operation of obtaining the second sound information related to the interactive object from the first audio, the server first determines an audio feature corresponding to the first audio, where the audio feature may be, for example, an MFCC feature or an FBank feature, or may be a feature extracted in other manners, which is not limited herein. Further, the server identifies the audio features by using a pre-trained decision tree model for predicting the voice features of the user, and determines attribute values corresponding to voice feature attributes, where the voice feature attributes include: dialect values (in actual practice, different numerical values are set for different dialects), gender values (different numerical values correspond to men and women), age values, tone values, and emotion values (different numerical values correspond to different emotions). In actual operation, a plurality of decision tree models can be trained to respectively predict the attribute value corresponding to each sound characteristic attribute. The decision tree model may be based on, for example, a convolutional neural network or other machine learning algorithm. Therefore, the second sound information can be determined through the model, the operation is more convenient and faster, and the result is more accurate.

Optionally, determining a voice style corresponding to the interactive object according to the voice image includes: determining a feature vector corresponding to the audio feature according to the voice portrait; and calculating the feature vector by utilizing a pre-trained model for predicting the voice style of the user, and determining the voice style corresponding to the interactive object.

Specifically, in the operation of determining the voice style corresponding to the interactive object based on the voice image, the server first determines a feature vector corresponding to the audio feature based on the voice image, and combines the dialect value, the gender value, the age value, the timbre value, and the emotion value into a feature vector corresponding to the audio feature as shown in fig. 3. Further, the server calculates the feature vector by using a pre-trained model for predicting the voice style of the user, and determines the voice style corresponding to the interactive object. Fig. 4 shows a flowchart of training a speech style model, and referring to fig. 4, in the training process of the speech style model, a speech style vector of a user and a corresponding speech style label (i.e., a speech style of the user) are used as model training data, and the model may perform multi-class training by using the training data by using, for example, an Xgboost algorithm, so as to finally obtain the speech style model. In practical applications, as shown in fig. 5, the feature vectors of the target object are input into the model, and the speech style corresponding to the interactive object is output.

Optionally, the method further comprises: determining text information for replying to the first audio and, based on the speech style, determining a second audio for replying to the first audio, comprising: the second audio is determined based on the speech style and the text information.

Specifically, referring to fig. 6, the server also determines text information (corresponding to mandarin text of the reply language in fig. 6) for replying to the first audio, and then the server determines the second audio based on the voice style and the text information. Namely: the server generates the second audio based on the determined voice style, so that the interactive object can be replied more accurately.

Optionally, determining a second audio for replying to the first audio comprises: and selecting a second audio for replying to the first audio from a preset voice library.

Specifically, as shown in fig. 6, in the operation of determining the second audio for replying to the first audio, the server selects the second audio for replying to the first audio from a preset voice library. That is, the same semantics as the determined text expression are selected from the multi-style speech library, and a reply is made to a speech piece (second audio) of a style preferred by the user (the style of speech corresponding to the interactive object). The voice database has audio segments with various voice styles, and the voice segments which are consistent with the voice style of the interactive object and correspond to the first audio can be searched from the voice database and output as the reply language. For example: and selecting voices (slow spoken Sichuan dialects) of corresponding styles from the voice library to perform emotion soothing on the interactive objects. Therefore, by the mode, the corresponding second audio can be selected from the existing voice library without generation, so that the computing resource can be saved and the efficiency can be improved.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

Therefore, according to the embodiment, the server can analyze the voice portrait of the interactive object from the audio of the interactive object, further select a proper voice style for the voice portrait of the interactive object, and finally continue to interact with the interactive object by using the phonetics of the voice style. Compared with the prior art, the method and the device can automatically analyze, predict and select the voice without the need of self-selection of the dialect style by the interactive object. Thus, the distance to the interactive object can be drawn without feeling the interactive object. Furthermore, the technical problem that the experience effect of the interactive object is influenced because the audio characteristics of the interactive object are not analyzed in the human-computer interaction process and a proper audio interaction style cannot be selected actively according to the audio characteristics of the interactive object in the prior art is solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 7 shows an apparatus 700 for audio interaction for speech styles according to the present embodiment, which apparatus 700 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 7, the apparatus 700 includes: an audio receiving module 710, configured to receive a first audio generated by the interactive object during the interactive process; a representation determining module 720, configured to determine a speech representation corresponding to the interactive object according to the first audio, where the speech representation is used to describe audio characteristics of the interactive object; the style determining module 730 is used for determining a voice style corresponding to the interactive object according to the voice image; and an audio determination module 740 for determining a second audio for replying to the first audio based on the voice style.

Optionally, portrait determination module 720 includes: the first determining submodule is used for acquiring first sound information related to the sound elements from the first audio; the second determining submodule is used for acquiring second sound information related to the interactive object from the first audio, wherein the second sound information is used for describing a sound characteristic attribute corresponding to the interactive object; and the portrait determining submodule is used for determining the voice portrait corresponding to the interactive object according to the first sound information and the second sound information.

Optionally, the second determining sub-module includes: a feature determination unit configured to determine an audio feature corresponding to the first audio; the characteristic identification unit is used for identifying the audio characteristics by utilizing a pre-trained decision tree model for predicting the voice characteristics of the user and determining an attribute value corresponding to the voice characteristic attribute; and a sound information determination unit configured to determine second sound information based on the attribute value corresponding to the sound characteristic attribute.

Optionally, the style determining module 730 includes: the vector generation submodule is used for determining a feature vector corresponding to the audio characteristic according to the voice portrait; and the style determining submodule is used for calculating the feature vector by utilizing a pre-trained model for predicting the voice style of the user and determining the voice style corresponding to the interactive object.

Optionally, the apparatus 700 further comprises: a text determination module to determine text information for replying to the first audio, and the audio determination module 740 comprising: a first audio determination sub-module for determining a second audio based on the speech style and the text information.

Optionally, the audio determining module 740 further includes: and the audio selection sub-module is used for selecting a second audio for replying the first audio from a preset voice library.

Optionally, the sound characteristic properties comprise at least one of: gender of the interactive object, age of the interactive object, dialect of the interactive object, tone quality of the interactive object, and mood of the interactive object.

Therefore, according to the embodiment, the apparatus 700 can analyze the voice portrait of the interactive object from the audio of the interactive object, further select an appropriate voice style for the voice portrait of the interactive object, and finally continue to interact with the interactive object by using the phonetics of the voice style. Compared with the prior art, the method and the device can automatically analyze, predict and select the voice without the need of self-selection of the dialect style by the interactive object. Thus, the distance to the interactive object can be drawn without feeling the interactive object. Furthermore, the technical problem that the experience effect of the interactive object is influenced because the audio characteristics of the interactive object are not analyzed in the human-computer interaction process and a proper audio interaction style cannot be selected actively according to the audio characteristics of the interactive object in the prior art is solved.

Example 3

Fig. 8 shows an apparatus 800 for audio interaction for speech styles according to the present embodiment, which apparatus 800 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 8, the apparatus 800 includes: a processor 810; and a memory 820 coupled to the processor 810 for providing instructions to the processor 810 to process the following process steps: receiving first audio generated by an interactive object in an interactive process; determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object; determining a voice style corresponding to the interactive object according to the voice image; and determining a second audio for replying to the first audio based on the voice style.

Optionally, the memory 820 is further configured to provide the processor 810 with instructions for processing the following processing steps: text information for replying to the first audio is determined, and the second audio is determined based on the voice style and the text information.

Therefore, according to the embodiment, the apparatus 800 can analyze the voice portrait of the interactive object from the audio of the interactive object, further select an appropriate voice style for the voice portrait of the interactive object, and finally continue to interact with the interactive object by using the phonetics of the voice style. Compared with the prior art, the method and the device can automatically analyze, predict and select the voice without the need of self-selection of the dialect style by the interactive object. Thus, the distance to the interactive object can be drawn without feeling the interactive object. Furthermore, the technical problem that the experience effect of the interactive object is influenced because the audio characteristics of the interactive object are not analyzed in the human-computer interaction process and a proper audio interaction style cannot be selected actively according to the audio characteristics of the interactive object in the prior art is solved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for audio interaction for speech styles, comprising:

receiving first audio generated by an interactive object in an interactive process;

determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object;

determining a voice style corresponding to the interactive object according to the voice portrait; and

determining a second audio for replying to the first audio based on the speech style.

2. The method of claim 1, wherein determining a speech representation corresponding to the interactive object based on the first audio comprises:

acquiring first sound information related to sound elements from the first audio;

acquiring second sound information related to the interactive object from the first audio, wherein the second sound information is used for describing a sound characteristic attribute corresponding to the interactive object; and

and determining a voice portrait corresponding to the interactive object according to the first sound information and the second sound information.

3. The method of claim 2, wherein obtaining second audio information related to the interactive object from the first audio comprises:

determining an audio feature corresponding to the first audio;

recognizing the audio features by utilizing a pre-trained decision tree model for predicting the voice features of the user, and determining attribute values corresponding to the voice feature attributes; and

and determining the second sound information according to the attribute value corresponding to the sound characteristic attribute.

4. The method of claim 1, wherein determining a speech style corresponding to the interactive object from the speech representation comprises:

determining a feature vector corresponding to the audio feature according to the voice portrait; and

and calculating the characteristic vector by utilizing a pre-trained model for predicting the voice style of the user, and determining the voice style corresponding to the interactive object.

5. The method of claim 1, further comprising: determining text information for replying to the first audio and, based on the speech style, determining a second audio for replying to the first audio, comprising:

determining the second audio based on the speech style and the text information.

6. The method of claim 1, wherein determining a second audio for replying to the first audio comprises:

and selecting a second audio for replying to the first audio from a preset voice library.

7. The method according to any one of claims 2 and 3, wherein the sound characteristic properties comprise at least one of:

gender of the interactive object, age of the interactive object, dialect of the interactive object, tone quality of the interactive object, and mood of the interactive object.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. An apparatus for audio interaction for speech styles, comprising:

the audio receiving module is used for receiving first audio generated by the interactive object in the interactive process;

the portrait determining module is used for determining a voice portrait corresponding to the interactive object according to the first audio, wherein the voice portrait is used for describing audio characteristics of the interactive object;

the style determining module is used for determining a voice style corresponding to the interactive object according to the voice portrait; and

an audio determination module to determine a second audio for replying to the first audio based on the voice style.

10. An apparatus for audio interaction for speech styles, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: