CN113299275A - Method and system for realizing voice interaction, service end, client and intelligent sound box - Google Patents

Method and system for realizing voice interaction, service end, client and intelligent sound box Download PDF

Info

Publication number
CN113299275A
CN113299275A CN202110556758.3A CN202110556758A CN113299275A CN 113299275 A CN113299275 A CN 113299275A CN 202110556758 A CN202110556758 A CN 202110556758A CN 113299275 A CN113299275 A CN 113299275A
Authority
CN
China
Prior art keywords
voice
awakening
user
interaction
voice interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110556758.3A
Other languages
Chinese (zh)
Inventor
周光东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Innovation Co
Original Assignee
Alibaba Singapore Holdings Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Singapore Holdings Pte Ltd filed Critical Alibaba Singapore Holdings Pte Ltd
Priority to CN202110556758.3A priority Critical patent/CN113299275A/en
Publication of CN113299275A publication Critical patent/CN113299275A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

On one hand, in a model training stage, through training of corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and therefore when a user uses the voice interaction device, the user can set favorite sound as the speaker to achieve voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. On the other hand, the user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who performs voice interaction with the client subsequently, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.

Description

Method and system for realizing voice interaction, service end, client and intelligent sound box
Technical Field
The present application relates to, but not limited to, intelligent voice technology, and in particular, to a method and a system for implementing voice interaction, a server, a client, and an intelligent speaker.
Background
The voice interaction products in the related art mostly adopt a single-role implementation scheme, that is, only one wake-up mode is used to implement wake-up. Thus, personalization is not possible for users using these voice interaction products.
Disclosure of Invention
The application provides a method and a system for realizing voice interaction, a service end, a client and an intelligent sound box, which can realize personalized setting of voice interaction products and improve user experience.
The embodiment of the invention provides a system for realizing voice interaction, which comprises: the system comprises a server and a client; wherein the content of the first and second substances,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
the release module is used for releasing audio training data corresponding to different speakers obtained by training and defined awakening words;
the client comprises: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
An embodiment of the present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
An embodiment of the present application further provides a client, including: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In one illustrative example, the setup processing module is further configured to:
and downloading part or all of the audio training data and the awakening words from the issued audio training data to the client.
In one illustrative example, the client further comprises: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to the server.
The embodiment of the present application further provides a method for implementing voice interaction, including:
performing voice synthesis training by utilizing corpus information comprising real records of a plurality of speakers to obtain a voice synthesis model;
training the defined awakening words by using audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
The embodiment of the present application further provides a method for implementing voice interaction, including:
selecting a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
In one illustrative example, further comprising:
and downloading part or all of the audio training data and the awakening words from the issued audio training data.
In one illustrative example, further comprising:
recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
In one illustrative example, the keyword is the name of the speaker.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
The embodiment of the present application further provides a method for implementing voice interaction, including:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying a role list and a wake-up word list of a voice assistant for realizing interaction with voice interaction equipment to a user;
determining a wake-up word and a voice assistant interacting with the voice interaction equipment at present according to the selection of a user;
receiving a wake-up word from a user and waking up the voice interaction equipment;
and performing voice interaction with the voice assistant selected by the user.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
uploading the recorded awakening words to a server so as to train new awakening words and update the awakening word list.
The embodiment of the present application further provides an intelligent sound box, which includes a memory and a processor, wherein the following instructions executable by the processor are stored in the memory: the method for implementing voice interaction provided by the embodiment of the present application is implemented.
The embodiment of the present application further provides a device for implementing voice interaction, which includes a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.
In the embodiment of the application, in the model training stage, through training the corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and thus, when the user uses the voice interaction device, the user can set the favorite sound as the speaker to realize the voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
FIG. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for implementing voice interaction;
FIG. 3 is a flow chart of another embodiment of a method for implementing voice interaction according to the present application;
fig. 4 is a schematic view of an application scenario of the method for implementing voice delivery according to the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
Fig. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application, as shown in fig. 1, including a server and a client; wherein the content of the first and second substances,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
For example, assuming there are 100 speakers, each generating 100 pieces of synthesized audio data, 1 ten thousand pieces of audio training data are obtained.
The client comprises: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In an exemplary embodiment, the setting processing module may be further configured to:
and downloading part or all of the audio training data corresponding to different speakers and the awakening words from the issued audio training data and awakening words to the client as locally selectable audio training data and awakening words of the client.
The published audio training data may include: a speech synthesis model, a voice wakeup model, and a list of models including different wakeup words and/or different speaker information, such that the client can select a corresponding model from the list of visible models to use.
The system for realizing speech synthesis provided by the embodiment of the application can obtain various speakers with different sound characteristics by training the corpora of a large number of different speakers in the model training stage, so that a user can set favorite sound as the speaker to realize speech interaction with the speech interaction equipment when using the speech interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.
With the development of speech synthesis technology, the use of star-like speech speakers in speech interaction products has become very common, and there is a natural need to use speakers as wake-up aids. In an application scenario of an embodiment, the system for realizing voice synthesis provided by the embodiment of the application realizes the consistency of awakening and broadcasting, and the interaction of the awakening and broadcasting consistency enables voice interaction to be more personalized and accords with the habit of natural interaction, so that the use interest of users in voice interaction products is improved. According to the system for realizing voice interaction, on one hand, when the server side confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the voice is released to be on-line; on the other hand, when the client selects the pronunciation role, the voice synthesis model and the voice awakening model are downloaded from the server at the same time, so that the user can use the name of the new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by the voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved.
In an illustrative example, the client may further include: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
According to the system for realizing voice interaction, the client side can set which awakening word to be used to awaken the client side according to self preference by providing a new mode of awakening word recording data for the server side, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed compared with a fixed awakening mode, and user experience is better.
The present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
When the server side provided by the application confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the server side simultaneously issues the voice to be online. In the model training stage, a plurality of speakers with different sound characteristics can be obtained by training the corpora of a large number of different speakers, so that a user can set favorite sound as the speaker to realize the voice interaction with the voice interaction device when using the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.
The present application further provides a client, comprising: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In an exemplary embodiment, the setting processing module may be further configured to:
and downloading part or all of the released audio training data and the awakening words to the client as locally selectable audio training data and awakening words of the client.
When the client selects the pronunciation role, the voice synthesis model and the voice awakening model can be downloaded from the server at the same time, so that a user can use the name of a new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by a voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved. The user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.
In an illustrative example, the client may further include: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
The client side provided by the embodiment of the application, through the mode of providing new awakening word recording data for the server side, the user can set which awakening word to be used to awaken the client side according to self preference, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed than the fixed awakening mode, and the user experience is better.
Fig. 2 is a flowchart of an embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 2, the method includes, for a server:
step 200: and performing voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model.
Step 201: and training the defined awakening words by using the audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain the voice awakening model.
Step 202: and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
According to the method for realizing voice synthesis, in the model training stage, the server side can obtain various pronouncing persons with different sound characteristics through training the linguistic data of a large number of different pronouncing persons, and therefore a user can set favorite sound as the pronouncing person to realize voice interaction with the voice interaction equipment when using the voice interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.
Fig. 3 is a flowchart of another embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 3, the method includes, for a client:
step 300: and according to the user requirements, selecting the current awakening words and the pronouncing persons for voice interaction from the issued audio training data and the awakening words.
Step 301: awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
By the method for realizing voice interaction, the user can select the awakening words required by the user at the client according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved.
In an illustrative example, the method for implementing voice interaction in the client may further include:
and downloading part or all of the released audio training data and the awakening words to the client as locally selectable audio training data and awakening words of the client.
In an illustrative example, the method for implementing voice interaction in the client may further include:
recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
In this embodiment, the client provides new wake-up word recording data to the server, and the user can set which wake-up word to use to wake up the client according to self preference, and iteration through such effect realizes the variability of the wake-up mode, and the wake-up effect is more guaranteed than a fixed wake-up mode, and user experience is also better.
The application also provides a computer-readable storage medium storing computer-executable instructions for executing any one of the above methods for realizing voice interaction.
The present application further provides a device for implementing voice interaction, including a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
Therefore, the user can select the awakening words required by the user at the client according to the requirements of the user, and can select the favorite sounds as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.
In one illustrative example, the keyword may be the name of a speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
The client side can set which awakening word to use to awaken the client side according to self preference by providing new awakening word recording data for the server side, and the awakening mode is changeable through iteration of the effect, so that the awakening effect is more guaranteed compared with a fixed awakening mode, and the user experience is better.
The embodiment of the present application further provides a method for implementing voice interaction, including:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
For example, when the voice interaction device receives the voice information from the user, if the voice information from the user matches with the wake-up word in the voice wake-up model, that is, the user uses the wake-up word that can be recognized by the voice interaction device, a request for waking up the voice interaction device is initiated to the voice interaction device, and the voice interaction device is woken up and enters a voice interaction state with the user using the set speaker. The voice awakening model is obtained by training defined awakening words through audio data synthesized by a plurality of speakers obtained through the voice synthesis model, the voice synthesis model is obtained by performing voice synthesis training through corpus information including real voice recordings of the speakers, namely, a user can select the awakening words required by the user at a client according to own requirements, and can select sounds preferred by the user as the speakers subsequently performing voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved. Such as: the keyword may be the name of the speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the variable effect of the voice assistant role is realized.
Fig. 4 is a schematic view of an application scenario of the method for implementing voice interaction according to the present application, in this embodiment, taking a voice interaction device as an intelligent sound box as an example, as shown in fig. 4, the intelligent sound box displays a role list and a wakeup word list of a voice assistant for implementing interaction with the voice interaction device to a user; the user selects a voice assistant that the user wants, such as the sound of a certain star that the user likes, from the role list through clicking, or gestures, or eye observation, on one hand, the star can be selected as the current voice assistant, and on the other hand, the user selects a wake-up word of the user's own mind set from the wake-up word list to wake up the smart sound box before interacting with the smart sound box. When the intelligent sound box receives that the awakening word from the user is the previously selected awakening word, the intelligent sound box is awakened; in this way, the smart speaker may invoke the previously selected voice assistant for subsequent voice interaction with the user. The voice interaction with the intelligent sound box is finished by setting the favorite sound as the speaker when the user uses the intelligent sound box. When a user needs to add or change the awakening words, the recorded awakening words can be uploaded to the server through the intelligent sound box. In this way, the server may further train the new wake word to update the wake word list.
Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims (12)

1. A system for enabling voice interaction, comprising: the system comprises a server and a client; wherein the content of the first and second substances,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
the release module is used for releasing audio training data corresponding to different speakers obtained by training and defined awakening words;
the client comprises: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
2. A server, comprising: the voice synthesis processing module, the voice awakening processing module and the release module; wherein the content of the first and second substances,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
3. A client, comprising: setting a processing module and an interaction module; wherein the content of the first and second substances,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
4. A method of enabling voice interaction, comprising:
performing voice synthesis training by utilizing corpus information comprising real records of a plurality of speakers to obtain a voice synthesis model;
training the defined awakening words by using audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
5. A method of enabling voice interaction, comprising:
selecting a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
6. A method of enabling voice interaction, comprising:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
7. The method of claim 6, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
8. A method of enabling voice interaction, comprising:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
9. A method of enabling voice interaction, comprising:
displaying a role list and a wake-up word list of a voice assistant for realizing interaction with voice interaction equipment to a user;
determining a wake-up word and a voice assistant interacting with the voice interaction equipment at present according to the selection of a user;
receiving a wake-up word from a user and waking up the voice interaction equipment;
and performing voice interaction with the voice assistant selected by the user.
10. The method of claim 9, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
uploading the recorded awakening words to a server so as to train new awakening words and update the awakening word list.
11. A smart sound box comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method of claim 9 or 10 for enabling a voice interaction.
12. An apparatus for enabling voice interaction, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method for enabling a voice interaction of claim 4 or for performing the steps of the method for enabling a voice interaction of claim 5.
CN202110556758.3A 2021-05-21 2021-05-21 Method and system for realizing voice interaction, service end, client and intelligent sound box Pending CN113299275A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110556758.3A CN113299275A (en) 2021-05-21 2021-05-21 Method and system for realizing voice interaction, service end, client and intelligent sound box

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110556758.3A CN113299275A (en) 2021-05-21 2021-05-21 Method and system for realizing voice interaction, service end, client and intelligent sound box

Publications (1)

Publication Number Publication Date
CN113299275A true CN113299275A (en) 2021-08-24

Family

ID=77323640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110556758.3A Pending CN113299275A (en) 2021-05-21 2021-05-21 Method and system for realizing voice interaction, service end, client and intelligent sound box

Country Status (1)

Country Link
CN (1) CN113299275A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012189A (en) * 2022-04-29 2023-11-07 荣耀终端有限公司 Voice recognition method and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971719A (en) * 2017-05-16 2017-07-21 上海智觅智能科技有限公司 A kind of offline changeable nonspecific sound speech recognition awakening method for waking up word
CN109346083A (en) * 2018-11-28 2019-02-15 北京猎户星空科技有限公司 A kind of intelligent sound exchange method and device, relevant device and storage medium
CN109545194A (en) * 2018-12-26 2019-03-29 出门问问信息科技有限公司 Wake up word pre-training method, apparatus, equipment and storage medium
CN109584860A (en) * 2017-09-27 2019-04-05 九阳股份有限公司 A kind of voice wakes up word and defines method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971719A (en) * 2017-05-16 2017-07-21 上海智觅智能科技有限公司 A kind of offline changeable nonspecific sound speech recognition awakening method for waking up word
CN109584860A (en) * 2017-09-27 2019-04-05 九阳股份有限公司 A kind of voice wakes up word and defines method and system
CN109346083A (en) * 2018-11-28 2019-02-15 北京猎户星空科技有限公司 A kind of intelligent sound exchange method and device, relevant device and storage medium
CN109545194A (en) * 2018-12-26 2019-03-29 出门问问信息科技有限公司 Wake up word pre-training method, apparatus, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012189A (en) * 2022-04-29 2023-11-07 荣耀终端有限公司 Voice recognition method and electronic equipment

Similar Documents

Publication Publication Date Title
US10381016B2 (en) Methods and apparatus for altering audio output signals
JP6505117B2 (en) Interaction of digital personal digital assistant by replication and rich multimedia at response
Boyer From media anthropology to the anthropology of mediation
US20180136612A1 (en) Social media based audiovisual work creation and sharing platform and method
Cassidy et al. Noise in and as Music
CN114011087B (en) Interaction system and distribution system for script killer
Pauletto et al. Exploring expressivity and emotion with artificial voice and speech technologies
CN109064787A (en) A kind of point reading equipment
CN113299275A (en) Method and system for realizing voice interaction, service end, client and intelligent sound box
CN112672207B (en) Audio data processing method, device, computer equipment and storage medium
Cortez Museums as sites for displaying sound materials: a five-use framework
Krewani McLuhan's Global Village Today: Transatlantic Perspectives
Gritten Depending on timbre
Guimarães Symbolic objects as sediments of the intersubjective stream of feelings
Sterne Multimodal scholarship in world soundscape project composition: Toward a different media-theoretical legacy (or: The WSP as OG DH)
CN112114770A (en) Interface guiding method, device and equipment based on voice interaction
Barker Experiments with time: the technical image in video art, new media and the digital humanities
Cayley Aurature at the End (s) of Electronic Literature
Erkut et al. From ecological sounding artifacts towards sonic artifact ecologies
Glitsos Somatechnics and popular music in digital contexts
Glitsos Ways of feeling: The transformation of emotional experience in music listening in the context of digitisation
Gee et al. Automation as echo
Lewis Ventriloquial acts: Critical reflections on the art of Foley
O’Halloran Posthumanist stylistics
Kheshti Sound Studies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240305

Address after: # 03-06, Lai Zan Da Building 1, 51 Belarusian Road, Singapore

Applicant after: Alibaba Innovation Co.

Country or region after: Singapore

Address before: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore

Applicant before: Alibaba Singapore Holdings Ltd.

Country or region before: Singapore

TA01 Transfer of patent application right