CN112185344A

CN112185344A - Voice interaction method and device, computer readable storage medium and processor

Info

Publication number: CN112185344A
Application number: CN202011034411.4A
Authority: CN
Inventors: 焦金珂; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-05

Abstract

The application provides a voice interaction method, a device, a computer readable storage medium and a processor, wherein the voice interaction method comprises the steps of obtaining voice data of a speaker; determining a preset voice tone to be broadcasted according to the voice data; and broadcasting by adopting a preset voice tone. The method determines the preset voice tone of the broadcast according to the acquired voice data and broadcasts by adopting the preset voice tone, can realize intelligent recommendation and switching of the broadcast tone when voice interaction is carried out with different users, realizes broadcast by adopting different voice tones for different speakers, does not depend on division and definition of voice styles such as active type, steady type, humorous type, lovely type or more true type, can accurately predict and judge the preset voice tone corresponding to the speaker, meets the requirement of recommending different tones for different people, and remarkably improves the interestingness and personalized experience of the users in the using process.

Description

Voice interaction method and device, computer readable storage medium and processor

Technical Field

The present application relates to the field of voice interaction, and in particular, to a voice interaction method, apparatus, computer-readable storage medium, processor, and voice interaction system.

Background

The current voice interaction technology is widely applied to the fields of intelligent robots, intelligent sound boxes, intelligent vehicle-mounted equipment, intelligent homes and the like, and people can control equipment or a system to execute commands or complete question and answer conversations through voice conversations. However, when the device performs voice interaction, it is usually monotonous to perform synthesized broadcast by using a tone preset by a system.

In order to increase interest and individuation of interaction, a plurality of tone libraries are set by individual equipment, and a user needs to manually switch configuration in system setting, but the problems that tone is automatically switched and different tone libraries are recommended to different people during real-time voice interaction cannot be met.

The current voice interaction intelligent recommendation generally focuses on content recommendation, i.e., personalized content is recommended for different people, such as music, stories or questions and answers. In the current voice interaction system, a user and equipment have a conversation to obtain a fixed tone broadcast preset by the system, namely, in the face of different users, a machine can be broadcasted by synthesizing the same tone, different broadcast tones cannot be intelligently recommended for different users, and more personalized services cannot be provided.

The above information disclosed in this background section is only for enhancement of understanding of the background of the technology described herein and, therefore, certain information may be included in the background that does not form the prior art that is already known in this country to a person of ordinary skill in the art.

Disclosure of Invention

The present application mainly aims to provide a voice interaction method, a voice interaction device, a computer-readable storage medium, a processor, and a voice interaction system, so as to solve the problem that it is difficult to perform voice broadcast of different timbres for different speakers in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a voice interaction method, including: acquiring voice data of a speaker; determining a preset voice tone to be broadcasted according to the voice data; and broadcasting by adopting the preset voice timbre.

Optionally, according to the voice data, determining a preset voice tone of broadcast, including extracting voiceprint features of the voice data; and determining the preset voice tone according to the voiceprint characteristics.

Optionally, determining the predetermined voice timbre according to the voiceprint feature, including determining a voiceprint feature in a voiceprint database, which is matched with the voiceprint feature of the voice data, as a target voiceprint feature; determining the person corresponding to the target voiceprint feature as a target person; and determining the preset voice tone corresponding to the target person as the preset voice tone.

Optionally, determining a voiceprint feature in a voiceprint database, which is matched with the voiceprint feature of the voice data, as a target voiceprint feature, including obtaining a voiceprint similarity between the voiceprint feature of the voice data and each voiceprint feature in the voiceprint database; determining whether the voiceprint similarity is greater than a voiceprint similarity threshold; and under the condition that the voiceprint similarity is greater than the voiceprint similarity threshold, determining the voiceprint feature in the voiceprint database corresponding to the maximum voiceprint similarity as the target voiceprint feature.

Optionally, determining a preset voice tone corresponding to the target person as the preset voice tone, including searching the preset voice tone corresponding to the target person in a tone library; and determining the preset voice tone as the preset voice tone.

Optionally, according to the voice data, determining a preset voice tone of broadcast, including extracting voice features of the voice data; and determining the preset voice tone according to the voice characteristics of the voice data.

Optionally, determining the predetermined voice tone according to the voice feature of the voice data, including obtaining voice similarity between the voice feature of the voice data and each voice feature in a voice feature library; determining the voice features in the voice feature library corresponding to the maximum voice similarity as target voice features; acquiring target identity attribute information corresponding to the target voice feature, wherein the target identity attribute information comprises at least one of the following information: gender, age group, language; and determining the preset voice tone corresponding to the target identity attribute information as the preset voice tone.

According to another aspect of the embodiments of the present invention, a voice interaction apparatus is provided, which includes an obtaining unit, a determining unit, and a broadcasting unit, where the obtaining unit is configured to obtain voice data of a speaker; the determining unit is used for determining the preset voice tone of the broadcast according to the voice data; the broadcasting unit is used for broadcasting by adopting the preset voice timbre.

According to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium including a stored program, wherein the program performs any one of the methods described above.

According to a further aspect of the embodiments of the present invention, there is provided a processor for executing a program, where the program executes to perform any one of the methods described above.

According to another aspect of embodiments of the present invention, there is also provided a voice interaction system, comprising one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of any of the above.

In the embodiment of the invention, the voice interaction method determines the broadcasted preset voice timbre according to the acquired voice data and broadcasts by adopting the preset voice timbre, so that the broadcasted timbre can be intelligently recommended and switched when voice interaction is carried out with different users, and broadcasting of different voice timbres by adopting different voice timbres can be realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 shows a schematic flow chart of a voice interaction method generation according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating components of a voice interaction apparatus according to an embodiment of the present application.

Wherein the figures include the following reference numerals:

10. an acquisition unit; 20. a determination unit; 30. and a broadcasting unit.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:

broadcasting timbre: in voice interaction, a user interacts with a machine, and the machine answers, which is generally broadcasted by using preset timbre of a speaker by using a voice synthesis technology. Different timbres (such as the sound of a man and a woman, the sound of a low lying man, the sound of a sweet woman, English, cantonese and the like) can be set for synthesized broadcasting.

And (3) voice classification: the audio is classified in the specified category by extracting audio features. Such as gender identification, age group identification, language (dialect) identification, etc., all belong to speech classification. Namely, the speaker is subjected to feature extraction and group division through voice data of the speaker.

As mentioned in the background, it is difficult to perform voice broadcasting with different timbres for different speakers in the prior art, and in order to solve the above problems, in an exemplary embodiment of the present application, a voice interaction method, an apparatus, a computer-readable storage medium, a processor, and a voice interaction system are provided.

According to an embodiment of the present application, a voice interaction method is provided.

Fig. 1 is a flow chart generated by a voice interaction method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

step S101, acquiring voice data of a speaker;

step S102, determining a preset voice tone to be broadcasted according to the voice data;

and step S103, broadcasting by adopting the preset voice timbre.

According to the voice interaction method, the broadcasted preset voice timbre is determined according to the acquired voice data and is broadcasted by adopting the preset voice timbre, so that the broadcasted timbre can be intelligently recommended and switched when voice interaction is carried out with different users, broadcasting of different voice timbres by adopting different speakers is realized, meanwhile, the method does not need to rely on division and definition of voice styles such as active type, stable type, humorous type, lovely type or more true type, the preset voice timbre corresponding to the speaker can be accurately predicted and judged, the requirement of recommending different timbres to different crowds is met, and interestingness and personalized experience of the users in the using process are remarkably improved.

In a specific embodiment of the present application, determining the broadcasted predetermined voice timbre according to the voice data includes: extracting the voiceprint characteristics of the voice data; and determining the preset voice tone according to the voiceprint characteristics. By extracting the voiceprint characteristics of the voice data and determining the preset voice timbre to broadcast according to the voiceprint characteristics, the problem that synthesis and broadcast are monotonous by using the timbre preset by a system is further avoided, and the interestingness and individuation of the interactive process are increased.

In another specific embodiment of the present application, determining the predetermined voice timbre according to the voiceprint feature includes: determining the voiceprint characteristics matched with the voiceprint characteristics of the voice data in the voiceprint database as target voiceprint characteristics; determining the person corresponding to the target voiceprint feature as a target person; and determining the preset voice tone corresponding to the target person as the preset voice tone. The target voiceprint characteristics are found by acquiring the voiceprint characteristics of the speaker and matching the voiceprint characteristics with the voiceprint characteristics in the voiceprint database, the target voiceprint characteristics are confirmed as the target personnel, the preset voice timbre corresponding to the target personnel is reported as the preset voice timbre, the reported timbre is further intelligently recommended, interestingness of a user in a voice interaction process is met, and satisfaction is improved.

According to an embodiment of the present application, determining a voiceprint feature in a voiceprint database that matches a voiceprint feature of the voice data as a target voiceprint feature includes: acquiring the voiceprint similarity between the voiceprint features of the voice data and each voiceprint feature in the voiceprint database; determining whether the voiceprint similarity is greater than a voiceprint similarity threshold; and under the condition that the voiceprint similarity is greater than the voiceprint similarity threshold, determining the voiceprint feature in the voiceprint database corresponding to the maximum voiceprint similarity as the target voiceprint feature. According to the method, the voiceprint similarity is obtained and compared with the voiceprint similarity threshold value to determine the target voiceprint characteristics, so that the intelligently recommended broadcast tone is matched with the voice data of the user, and the experience and satisfaction of the user in the voice interaction process are further guaranteed.

Specifically, when the obtained voiceprint features are matched with each voiceprint feature in the voiceprint database to obtain the voiceprint similarity, only when the voiceprint similarity reaches the set voiceprint similarity threshold, determining a target voiceprint feature and broadcasting the corresponding preset voice timbre; when a plurality of voiceprint similarities reaching the voiceprint similarity threshold are obtained through matching, determining the voiceprint feature corresponding to the maximum voiceprint similarity as the target voiceprint feature and broadcasting the preset voice timbre corresponding to the target voiceprint feature; and when the voiceprint similarity obtained through matching does not reach the voiceprint similarity threshold value, broadcasting by using default synthesized timbre.

In order to further ensure that the broadcasted predetermined voice tone has a high matching degree with the voice data of the speaker, and increase the interest and personalization of the interaction process, according to another embodiment of the present application, determining the preset voice tone corresponding to the target person as the predetermined voice tone includes: searching the preset voice timbre corresponding to the target person in a timbre library; and determining the preset voice tone as the preset voice tone.

In another exemplary embodiment of the present application, determining a predetermined voice tone of a broadcast according to the voice data includes: extracting voice characteristics of the voice data; and determining the preset voice tone according to the voice characteristics of the voice data. The preset voice timbre is determined by extracting the voice features, the voice styles such as active type, stable type, model, lovely type or more true type do not need to be divided, the implementability is high, the voice features of the speaker can be predicted and judged accurately, and the requirement of intelligently recommending corresponding timbres for different people is further met.

Specifically, the speech features include pitch frequency and/or formant bandwidth, and may also include features such as MFCC (mel frequency cepstral coefficient), LPC (linear prediction coefficient), LPCC (linear prediction cepstral coefficient), and/or LSF (line spectral frequency).

In another specific embodiment of the present application, the determining the predetermined voice tone according to the voice feature of the voice data includes: acquiring voice similarity between the voice features of the voice data and each voice feature in a voice feature library; determining the voice features in the voice feature library corresponding to the maximum voice similarity as target voice features; acquiring target identity attribute information corresponding to the target voice feature, wherein the target identity attribute information comprises at least one of the following information: gender, age group, language; and determining the preset voice tone corresponding to the target identity attribute information as the preset voice tone. The preset voice tone is determined by determining the voice feature corresponding to the maximum voice similarity as a target voice feature and acquiring the identity attribute information such as gender, age group, language and the like corresponding to the target voice feature, so that the voice tone intelligently recommended and switched in the voice interaction process is more fit with the identity attribute information of a speaker, and the interestingness and satisfaction of a user in the using process are further improved. Of course, the target identity attribute information may include at least one of gender, age group, and language, and other attribute information such as a speech rate and a speaking rhythm.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The embodiment of the present application further provides a voice interaction apparatus, and it should be noted that the voice interaction apparatus according to the embodiment of the present application may be used to execute the voice interaction method according to the embodiment of the present application. The following describes a voice interaction apparatus provided in an embodiment of the present application.

Fig. 2 is a schematic composition diagram of a voice interaction apparatus according to an embodiment of the present application. As shown in fig. 2, the apparatus includes an obtaining unit 10, a determining unit 20, and a broadcasting unit 30, wherein the obtaining unit is configured to obtain voice data of a speaker; the determining unit is used for determining the preset voice tone of the broadcast according to the voice data; the broadcasting unit is used for broadcasting by adopting the preset voice timbre.

The voice interaction device determines the broadcasted preset voice tone by the determining unit according to the voice data acquired by the acquiring unit, and broadcasts by the broadcasting unit by using the preset voice tone, so that intelligent recommendation of the voice tone is realized, identity attribute information of a speaker can be accurately predicted and judged, matching of the intelligently recommended voice tone with the identity attribute information of the speaker is guaranteed, the requirement of recommending different tones for different crowds is met, the degree of individuation is high, and the use feeling and interestingness of a user in the interaction process are remarkably improved.

According to an exemplary embodiment of the present application, the determining unit includes a first extracting module and a first determining module, wherein the first extracting module is configured to extract a voiceprint feature of the voice data; the first determining module is configured to determine the predetermined voice timbre according to the voiceprint feature. Through extracting the voiceprint characteristics of the voice data and determining the preset voice timbre to broadcast according to the voiceprint characteristics, the problem that synthesis and broadcast are monotonous by using the timbre preset by a system is avoided, and the interestingness and individuation of the interactive process are increased.

According to another embodiment of the present application, the first determining module includes a first determining submodule, a second determining submodule, and a third determining submodule, wherein the first determining submodule is configured to determine a voiceprint feature in a voiceprint database, which matches a voiceprint feature of the voice data, as a target voiceprint feature; the second determining submodule is used for determining the person corresponding to the target voiceprint characteristic as a target person; the third determining submodule is configured to determine that the preset voice tone corresponding to the target person is the predetermined voice tone. The target voiceprint characteristics are found by acquiring the voiceprint characteristics of the speaker and matching the voiceprint characteristics with the voiceprint characteristics in the voiceprint database, the target voiceprint characteristics are confirmed as the target personnel, the preset voice timbre corresponding to the target personnel is reported as the preset voice timbre, the reported timbre is further intelligently recommended, interestingness of a user in a voice interaction process is met, and satisfaction is improved.

In another specific embodiment of the present application, the first determining sub-module is further configured to obtain voiceprint similarities between voiceprint features of the voice data and each voiceprint feature in the voiceprint database; determining whether the voiceprint similarity is greater than a voiceprint similarity threshold; and under the condition that the voiceprint similarity is greater than the voiceprint similarity threshold, determining the voiceprint feature in the voiceprint database corresponding to the maximum voiceprint similarity as the target voiceprint feature. The device determines the target voiceprint characteristics by acquiring the voiceprint similarity and comparing the voiceprint similarity with the voiceprint similarity threshold, so that the intelligently recommended broadcast tone is matched with the voice data of the user, and the experience and satisfaction of the user in the voice interaction process are further ensured.

In order to further ensure that the broadcasted predetermined voice tone has a high matching degree with the voice data of the speaker, and increase the interest and personalization of the interaction process, according to an embodiment of the present application, the third determining sub-module is further configured to search the preset voice tone corresponding to the target person in a tone library; and determining the preset voice tone as the preset voice tone.

According to another specific embodiment of the present application, the determining unit further includes a second extracting module and a second determining module, wherein the second extracting module is configured to extract a voice feature of the voice data; the second determining module is configured to determine the predetermined voice tone according to a voice feature of the voice data. The preset voice timbre is determined by extracting the voice features, the voice styles such as active type, stable type, model, lovely type or more true type do not need to be divided, the implementability is high, the voice features of the speaker can be predicted and judged accurately, and the requirement of intelligently recommending corresponding timbres for different crowds is further met.

In another embodiment of the present application, the second determining module includes a first obtaining sub-module, a fourth determining sub-module, a second obtaining sub-module, and a fifth determining sub-module, where the first obtaining sub-module is configured to obtain a voice similarity between a voice feature of the voice data and each voice feature in a voice feature library; the fourth determining submodule is used for determining the voice feature in the voice feature library corresponding to the maximum voice similarity as a target voice feature; the second obtaining sub-module is configured to obtain target identity attribute information corresponding to the target voice feature, where the target identity attribute information includes at least one of: gender, age group, language; the fifth determining submodule is configured to determine that the preset voice tone corresponding to the target identity attribute information is the predetermined voice tone. The preset voice tone is determined by obtaining the identity attribute information such as gender, age group, language and the like, so that the voice tone intelligently recommended and switched in the voice interaction process is more fit with the identity attribute information of the speaker, and the interestingness and satisfaction of a user in the using process are further improved. Of course, the target identity attribute information may include at least one of gender, age group, and language, and other attribute information such as a speech rate and a speaking rhythm.

The voice interaction device comprises a processor and a memory, the acquiring unit 10, the determining unit 20, the broadcasting unit 30 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem that voice broadcasting with different timbres is difficult to carry out in the prior art aiming at different speakers is solved by adjusting kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium, on which a program is stored, and the program, when executed by a processor, implements the above-mentioned voice interaction method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the voice interaction method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein when the processor executes the program, at least the following steps are realized:

step S101, acquiring voice data of a speaker;

and step S103, broadcasting by adopting the preset voice timbre.

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program of initializing at least the following method steps when executed on a data processing device:

step S101, acquiring voice data of a speaker;

and step S103, broadcasting by adopting the preset voice timbre.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

In order to make the technical solutions of the present application more clearly understood by those skilled in the art, the following description will be given with reference to specific embodiments.

Example 1

In private occasions such as home furnishing and vehicle-mounted, a user can conveniently register voiceprint characteristics on own intelligent equipment operating the voice interaction method, and the voice interaction method can conveniently identify which member a speaker is according to the voiceprint characteristics. By configuring a preset voice tone matched with the voiceprint features as a preset voice tone, for example, a male owner is set as a preset voice tone for broadcasting, a female owner is set as a preset voice tone for broadcasting, a boy is set as a preset voice tone for broadcasting, and a default is D preset voice tone for broadcasting, so that after the voice data are collected by using the voice interaction method, target personnel can be determined through the voiceprint features, and then the preset voice tone is automatically switched to and synthesized into a broadcast. Particularly, if strangers outside the family, i.e., persons who have not previously registered voiceprint features and are not configured with personalized timbres, do voice interaction, the voice interaction method cannot identify the user identity, and at this time, the default timbre D is used for broadcasting.

Example 2

In public occasions such as a hall and the like, the user does not register the voiceprint features in advance, and intelligent equipment operating the voice interaction method cannot accurately identify user identity information. At this time, the corresponding relationship between the target identity attribute information and the predetermined voice timbre is preset through the voice interaction method, for example, a middle-aged man + cantonese is set as a timbre broadcast, and a girl + mandarin is set as a B timbre broadcast, so that the intelligent device extracts the voice features of the voice data after acquiring the voice data, obtains the target identity attribute information through the voice features, for example, a man, a middle-aged person, a cantonese, a woman, a child, and a mandarin, determines the predetermined voice timbre, and automatically switches to the predetermined voice timbre and synthesizes the predetermined voice timbre into a broadcast. The attributes here include the following categories: age group, children, young, middle-aged, elderly; sex, male, female; the language, mandarin, english, cantonese, Sichuan and Shanghai languages, of course, the target identity attribute information may further include other information, the age group and the language may further include other categories, which user attributes can be identified through the voice features, depending on the development of the current voice classification technology.

From the above description, it can be seen that the above-described embodiments of the present application achieve the following technical effects:

1) the preset voice timbre determined and broadcasted according to the acquired voice data can be broadcasted, the broadcasted timbres can be intelligently recommended and switched when voice interaction is carried out with different users, the fact that different voice timbres are broadcasted is achieved, meanwhile, the method does not need to rely on division and definition of voice styles such as active type, stable type, humorous type, lovely type or more genuine type, the preset voice timbre corresponding to the speaker can be accurately predicted and judged, the requirement of recommending different timbres to different crowds is met, and interestingness and personalized experience of the users in the using process are remarkably improved.

2) The voice interaction device comprises an acquisition unit, a determination unit, a broadcast unit and a broadcast unit, wherein the determination unit determines the preset voice tone color of the broadcast according to the voice data acquired by the acquisition unit, and the broadcast unit broadcasts the preset voice tone color.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of voice interaction, comprising:

acquiring voice data of a speaker;

determining a preset voice tone to be broadcasted according to the voice data;

and broadcasting by adopting the preset voice timbre.

2. The method of claim 1, wherein determining a predetermined voice tone for the broadcast based on the voice data comprises:

extracting voiceprint features of the voice data;

and determining the preset voice tone according to the voiceprint characteristics.

3. The method of claim 2, wherein determining the predetermined voice timbre based on the voiceprint feature comprises:

determining a voiceprint feature matched with the voiceprint feature of the voice data in a voiceprint database as a target voiceprint feature;

determining the person corresponding to the target voiceprint feature as a target person;

and determining the preset voice tone corresponding to the target person as the preset voice tone.

4. The method of claim 3, wherein determining the voiceprint features in the voiceprint database that match the voiceprint features of the speech data as target voiceprint features comprises:

acquiring voiceprint similarity between the voiceprint features of the voice data and each voiceprint feature in the voiceprint database;

determining whether the voiceprint similarity is greater than a voiceprint similarity threshold;

and under the condition that the voiceprint similarity is greater than the voiceprint similarity threshold, determining the voiceprint feature in the voiceprint database corresponding to the maximum voiceprint similarity as the target voiceprint feature.

5. The method according to claim 3 or 4, wherein determining the preset voice timbre corresponding to the target person as the predetermined voice timbre comprises:

searching the preset voice tone corresponding to the target person in a tone library;

and determining the preset voice tone as the preset voice tone.

6. The method of claim 1, wherein determining a predetermined voice tone for the broadcast based on the voice data comprises:

extracting voice features of the voice data;

and determining the preset voice tone according to the voice characteristics of the voice data.

7. The method of claim 6, wherein determining the predetermined voice timbre based on the voice characteristic of the voice data comprises:

acquiring voice similarity between the voice features of the voice data and each voice feature in a voice feature library;

determining the voice features in the voice feature library corresponding to the maximum voice similarity as target voice features;

acquiring target identity attribute information corresponding to the target voice feature, wherein the target identity attribute information comprises at least one of the following information: gender, age group, language;

and determining the preset voice tone corresponding to the target identity attribute information as the preset voice tone.

8. A voice interaction apparatus, comprising:

the acquisition unit is used for acquiring voice data of a speaker;

the determining unit is used for determining the preset voice tone of the broadcast according to the voice data;

and the broadcasting unit is used for broadcasting the preset voice timbre.

9. A computer-readable storage medium, characterized in that the storage medium comprises a stored program, wherein the program performs the method of any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.

11. A voice interaction system, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of any of claims 1-7.