CN113299275A

CN113299275A - Method and system for realizing voice interaction, service end, client and intelligent sound box

Info

Publication number: CN113299275A
Application number: CN202110556758.3A
Authority: CN
Inventors: 周光东
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-24

Abstract

On one hand, in a model training stage, through training of corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and therefore when a user uses the voice interaction device, the user can set favorite sound as the speaker to achieve voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. On the other hand, the user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who performs voice interaction with the client subsequently, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.

Description

Method and system for realizing voice interaction, service end, client and intelligent sound box

Technical Field

The present application relates to, but not limited to, intelligent voice technology, and in particular, to a method and a system for implementing voice interaction, a server, a client, and an intelligent speaker.

Background

The voice interaction products in the related art mostly adopt a single-role implementation scheme, that is, only one wake-up mode is used to implement wake-up. Thus, personalization is not possible for users using these voice interaction products.

Disclosure of Invention

The application provides a method and a system for realizing voice interaction, a service end, a client and an intelligent sound box, which can realize personalized setting of voice interaction products and improve user experience.

The embodiment of the invention provides a system for realizing voice interaction, which comprises: the system comprises a server and a client; wherein,

the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,

the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;

the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;

the release module is used for releasing audio training data corresponding to different speakers obtained by training and defined awakening words;

the client comprises: setting a processing module and an interaction module; wherein,

the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;

the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.

An embodiment of the present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,

and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.

An embodiment of the present application further provides a client, including: setting a processing module and an interaction module; wherein,

In one illustrative example, the setup processing module is further configured to:

and downloading part or all of the audio training data and the awakening words from the issued audio training data to the client.

In one illustrative example, the client further comprises: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to the server.

The embodiment of the present application further provides a method for implementing voice interaction, including:

performing voice synthesis training by utilizing corpus information comprising real records of a plurality of speakers to obtain a voice synthesis model;

training the defined awakening words by using audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;

and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.

selecting a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;

awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.

In one illustrative example, further comprising:

and downloading part or all of the audio training data and the awakening words from the issued audio training data.

In one illustrative example, further comprising:

recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server.

displaying audio training data and awakening words corresponding to different speakers to a user;

setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;

and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.

In one illustrative example, the keyword is the name of the speaker.

In one illustrative example, further comprising:

receiving a recording instruction from a user, and recording a user-defined awakening word;

and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.

receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;

and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.

displaying a role list and a wake-up word list of a voice assistant for realizing interaction with voice interaction equipment to a user;

determining a wake-up word and a voice assistant interacting with the voice interaction equipment at present according to the selection of a user;

receiving a wake-up word from a user and waking up the voice interaction equipment;

and performing voice interaction with the voice assistant selected by the user.

In one illustrative example, further comprising:

uploading the recorded awakening words to a server so as to train new awakening words and update the awakening word list.

The embodiment of the present application further provides an intelligent sound box, which includes a memory and a processor, wherein the following instructions executable by the processor are stored in the memory: the method for implementing voice interaction provided by the embodiment of the present application is implemented.

The embodiment of the present application further provides a device for implementing voice interaction, which includes a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.

In the embodiment of the application, in the model training stage, through training the corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and thus, when the user uses the voice interaction device, the user can set the favorite sound as the speaker to realize the voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application;

FIG. 2 is a flowchart of an embodiment of a method for implementing voice interaction;

FIG. 3 is a flow chart of another embodiment of a method for implementing voice interaction according to the present application;

fig. 4 is a schematic view of an application scenario of the method for implementing voice delivery according to the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application, as shown in fig. 1, including a server and a client; wherein,

For example, assuming there are 100 speakers, each generating 100 pieces of synthesized audio data, 1 ten thousand pieces of audio training data are obtained.

In an exemplary embodiment, the setting processing module may be further configured to:

and downloading part or all of the audio training data corresponding to different speakers and the awakening words from the issued audio training data and awakening words to the client as locally selectable audio training data and awakening words of the client.

The published audio training data may include: a speech synthesis model, a voice wakeup model, and a list of models including different wakeup words and/or different speaker information, such that the client can select a corresponding model from the list of visible models to use.

The system for realizing speech synthesis provided by the embodiment of the application can obtain various speakers with different sound characteristics by training the corpora of a large number of different speakers in the model training stage, so that a user can set favorite sound as the speaker to realize speech interaction with the speech interaction equipment when using the speech interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.

With the development of speech synthesis technology, the use of star-like speech speakers in speech interaction products has become very common, and there is a natural need to use speakers as wake-up aids. In an application scenario of an embodiment, the system for realizing voice synthesis provided by the embodiment of the application realizes the consistency of awakening and broadcasting, and the interaction of the awakening and broadcasting consistency enables voice interaction to be more personalized and accords with the habit of natural interaction, so that the use interest of users in voice interaction products is improved. According to the system for realizing voice interaction, on one hand, when the server side confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the voice is released to be on-line; on the other hand, when the client selects the pronunciation role, the voice synthesis model and the voice awakening model are downloaded from the server at the same time, so that the user can use the name of the new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by the voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved.

In an illustrative example, the client may further include: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.

According to the system for realizing voice interaction, the client side can set which awakening word to be used to awaken the client side according to self preference by providing a new mode of awakening word recording data for the server side, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed compared with a fixed awakening mode, and user experience is better.

The present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,

When the server side provided by the application confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the server side simultaneously issues the voice to be online. In the model training stage, a plurality of speakers with different sound characteristics can be obtained by training the corpora of a large number of different speakers, so that a user can set favorite sound as the speaker to realize the voice interaction with the voice interaction device when using the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.

The present application further provides a client, comprising: setting a processing module and an interaction module; wherein,

and downloading part or all of the released audio training data and the awakening words to the client as locally selectable audio training data and awakening words of the client.

When the client selects the pronunciation role, the voice synthesis model and the voice awakening model can be downloaded from the server at the same time, so that a user can use the name of a new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by a voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved. The user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.

The client side provided by the embodiment of the application, through the mode of providing new awakening word recording data for the server side, the user can set which awakening word to be used to awaken the client side according to self preference, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed than the fixed awakening mode, and the user experience is better.

Fig. 2 is a flowchart of an embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 2, the method includes, for a server:

step 200: and performing voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model.

Step 201: and training the defined awakening words by using the audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain the voice awakening model.

Step 202: and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.

According to the method for realizing voice synthesis, in the model training stage, the server side can obtain various pronouncing persons with different sound characteristics through training the linguistic data of a large number of different pronouncing persons, and therefore a user can set favorite sound as the pronouncing person to realize voice interaction with the voice interaction equipment when using the voice interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.

Fig. 3 is a flowchart of another embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 3, the method includes, for a client:

step 300: and according to the user requirements, selecting the current awakening words and the pronouncing persons for voice interaction from the issued audio training data and the awakening words.

Step 301: awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.

By the method for realizing voice interaction, the user can select the awakening words required by the user at the client according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved.

In an illustrative example, the method for implementing voice interaction in the client may further include:

recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.

In this embodiment, the client provides new wake-up word recording data to the server, and the user can set which wake-up word to use to wake up the client according to self preference, and iteration through such effect realizes the variability of the wake-up mode, and the wake-up effect is more guaranteed than a fixed wake-up mode, and user experience is also better.

The application also provides a computer-readable storage medium storing computer-executable instructions for executing any one of the above methods for realizing voice interaction.

The present application further provides a device for implementing voice interaction, including a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.

Therefore, the user can select the awakening words required by the user at the client according to the requirements of the user, and can select the favorite sounds as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.

In one illustrative example, the keyword may be the name of a speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved.

In one illustrative example, further comprising:

The client side can set which awakening word to use to awaken the client side according to self preference by providing new awakening word recording data for the server side, and the awakening mode is changeable through iteration of the effect, so that the awakening effect is more guaranteed compared with a fixed awakening mode, and the user experience is better.

For example, when the voice interaction device receives the voice information from the user, if the voice information from the user matches with the wake-up word in the voice wake-up model, that is, the user uses the wake-up word that can be recognized by the voice interaction device, a request for waking up the voice interaction device is initiated to the voice interaction device, and the voice interaction device is woken up and enters a voice interaction state with the user using the set speaker. The voice awakening model is obtained by training defined awakening words through audio data synthesized by a plurality of speakers obtained through the voice synthesis model, the voice synthesis model is obtained by performing voice synthesis training through corpus information including real voice recordings of the speakers, namely, a user can select the awakening words required by the user at a client according to own requirements, and can select sounds preferred by the user as the speakers subsequently performing voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved. Such as: the keyword may be the name of the speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the variable effect of the voice assistant role is realized.

Fig. 4 is a schematic view of an application scenario of the method for implementing voice interaction according to the present application, in this embodiment, taking a voice interaction device as an intelligent sound box as an example, as shown in fig. 4, the intelligent sound box displays a role list and a wakeup word list of a voice assistant for implementing interaction with the voice interaction device to a user; the user selects a voice assistant that the user wants, such as the sound of a certain star that the user likes, from the role list through clicking, or gestures, or eye observation, on one hand, the star can be selected as the current voice assistant, and on the other hand, the user selects a wake-up word of the user's own mind set from the wake-up word list to wake up the smart sound box before interacting with the smart sound box. When the intelligent sound box receives that the awakening word from the user is the previously selected awakening word, the intelligent sound box is awakened; in this way, the smart speaker may invoke the previously selected voice assistant for subsequent voice interaction with the user. The voice interaction with the intelligent sound box is finished by setting the favorite sound as the speaker when the user uses the intelligent sound box. When a user needs to add or change the awakening words, the recorded awakening words can be uploaded to the server through the intelligent sound box. In this way, the server may further train the new wake word to update the wake word list.

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A system for enabling voice interaction, comprising: the system comprises a server and a client; wherein,

2. A server, comprising: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,

3. A client, comprising: setting a processing module and an interaction module; wherein,

4. A method of enabling voice interaction, comprising:

5. A method of enabling voice interaction, comprising:

6. A method of enabling voice interaction, comprising:

7. The method of claim 6, further comprising:

8. A method of enabling voice interaction, comprising:

9. A method of enabling voice interaction, comprising:

and performing voice interaction with the voice assistant selected by the user.

10. The method of claim 9, further comprising:

11. A smart sound box comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method of claim 9 or 10 for enabling a voice interaction.

12. An apparatus for enabling voice interaction, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method for enabling a voice interaction of claim 4 or for performing the steps of the method for enabling a voice interaction of claim 5.