CN113299275A - Method and system for realizing voice interaction, service end, client and intelligent sound box - Google Patents
Method and system for realizing voice interaction, service end, client and intelligent sound box Download PDFInfo
- Publication number
- CN113299275A CN113299275A CN202110556758.3A CN202110556758A CN113299275A CN 113299275 A CN113299275 A CN 113299275A CN 202110556758 A CN202110556758 A CN 202110556758A CN 113299275 A CN113299275 A CN 113299275A
- Authority
- CN
- China
- Prior art keywords
- voice
- awakening
- user
- interaction
- voice interaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 156
- 238000000034 method Methods 0.000 title claims description 38
- 230000015572 biosynthetic process Effects 0.000 claims description 66
- 238000003786 synthesis reaction Methods 0.000 claims description 66
- 230000002618 waking effect Effects 0.000 claims description 12
- 230000000694 effects Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/4401—Bootstrapping
- G06F9/4418—Suspend and resume; Hibernate and awake
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Telephonic Communication Services (AREA)
Abstract
On one hand, in a model training stage, through training of corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and therefore when a user uses the voice interaction device, the user can set favorite sound as the speaker to achieve voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. On the other hand, the user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who performs voice interaction with the client subsequently, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.
Description
Technical Field
The present application relates to, but not limited to, intelligent voice technology, and in particular, to a method and a system for implementing voice interaction, a server, a client, and an intelligent speaker.
Background
The voice interaction products in the related art mostly adopt a single-role implementation scheme, that is, only one wake-up mode is used to implement wake-up. Thus, personalization is not possible for users using these voice interaction products.
Disclosure of Invention
The application provides a method and a system for realizing voice interaction, a service end, a client and an intelligent sound box, which can realize personalized setting of voice interaction products and improve user experience.
The embodiment of the invention provides a system for realizing voice interaction, which comprises: the system comprises a server and a client; wherein,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
the release module is used for releasing audio training data corresponding to different speakers obtained by training and defined awakening words;
the client comprises: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
An embodiment of the present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
An embodiment of the present application further provides a client, including: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In one illustrative example, the setup processing module is further configured to:
and downloading part or all of the audio training data and the awakening words from the issued audio training data to the client.
In one illustrative example, the client further comprises: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to the server.
The embodiment of the present application further provides a method for implementing voice interaction, including:
performing voice synthesis training by utilizing corpus information comprising real records of a plurality of speakers to obtain a voice synthesis model;
training the defined awakening words by using audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
The embodiment of the present application further provides a method for implementing voice interaction, including:
selecting a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
In one illustrative example, further comprising:
and downloading part or all of the audio training data and the awakening words from the issued audio training data.
In one illustrative example, further comprising:
recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
In one illustrative example, the keyword is the name of the speaker.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
The embodiment of the present application further provides a method for implementing voice interaction, including:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying a role list and a wake-up word list of a voice assistant for realizing interaction with voice interaction equipment to a user;
determining a wake-up word and a voice assistant interacting with the voice interaction equipment at present according to the selection of a user;
receiving a wake-up word from a user and waking up the voice interaction equipment;
and performing voice interaction with the voice assistant selected by the user.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
uploading the recorded awakening words to a server so as to train new awakening words and update the awakening word list.
The embodiment of the present application further provides an intelligent sound box, which includes a memory and a processor, wherein the following instructions executable by the processor are stored in the memory: the method for implementing voice interaction provided by the embodiment of the present application is implemented.
The embodiment of the present application further provides a device for implementing voice interaction, which includes a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.
In the embodiment of the application, in the model training stage, through training the corpora of a large number of different speakers, a plurality of speakers with different sound characteristics can be obtained, and thus, when the user uses the voice interaction device, the user can set the favorite sound as the speaker to realize the voice interaction with the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
FIG. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for implementing voice interaction;
FIG. 3 is a flow chart of another embodiment of a method for implementing voice interaction according to the present application;
fig. 4 is a schematic view of an application scenario of the method for implementing voice delivery according to the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
Fig. 1 is a schematic structural diagram of a system for implementing voice interaction in an embodiment of the present application, as shown in fig. 1, including a server and a client; wherein,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
For example, assuming there are 100 speakers, each generating 100 pieces of synthesized audio data, 1 ten thousand pieces of audio training data are obtained.
The client comprises: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In an exemplary embodiment, the setting processing module may be further configured to:
and downloading part or all of the audio training data corresponding to different speakers and the awakening words from the issued audio training data and awakening words to the client as locally selectable audio training data and awakening words of the client.
The published audio training data may include: a speech synthesis model, a voice wakeup model, and a list of models including different wakeup words and/or different speaker information, such that the client can select a corresponding model from the list of visible models to use.
The system for realizing speech synthesis provided by the embodiment of the application can obtain various speakers with different sound characteristics by training the corpora of a large number of different speakers in the model training stage, so that a user can set favorite sound as the speaker to realize speech interaction with the speech interaction equipment when using the speech interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched. Through the system for realizing voice interaction provided by the embodiment of the application, a user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of a voice interaction product is realized, and the user experience is improved.
With the development of speech synthesis technology, the use of star-like speech speakers in speech interaction products has become very common, and there is a natural need to use speakers as wake-up aids. In an application scenario of an embodiment, the system for realizing voice synthesis provided by the embodiment of the application realizes the consistency of awakening and broadcasting, and the interaction of the awakening and broadcasting consistency enables voice interaction to be more personalized and accords with the habit of natural interaction, so that the use interest of users in voice interaction products is improved. According to the system for realizing voice interaction, on one hand, when the server side confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the voice is released to be on-line; on the other hand, when the client selects the pronunciation role, the voice synthesis model and the voice awakening model are downloaded from the server at the same time, so that the user can use the name of the new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by the voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved.
In an illustrative example, the client may further include: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
According to the system for realizing voice interaction, the client side can set which awakening word to be used to awaken the client side according to self preference by providing a new mode of awakening word recording data for the server side, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed compared with a fixed awakening mode, and user experience is better.
The present application further provides a server, including: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
When the server side provided by the application confirms a pronunciation role, the voice synthesis model is generated, the voice awakening model is generated at the same time, and the server side simultaneously issues the voice to be online. In the model training stage, a plurality of speakers with different sound characteristics can be obtained by training the corpora of a large number of different speakers, so that a user can set favorite sound as the speaker to realize the voice interaction with the voice interaction device when using the voice interaction device. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.
The present application further provides a client, comprising: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
In an exemplary embodiment, the setting processing module may be further configured to:
and downloading part or all of the released audio training data and the awakening words to the client as locally selectable audio training data and awakening words of the client.
When the client selects the pronunciation role, the voice synthesis model and the voice awakening model can be downloaded from the server at the same time, so that a user can use the name of a new speaker as an awakening word to awaken a voice assistant (also called a reporter) provided by a voice interaction product, and meanwhile, the voice assistant can also adopt the newly set speaker to broadcast. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the effect that the role of the voice assistant can be changed is achieved. The user can select the awakening words required by the user according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.
In an illustrative example, the client may further include: the uploading processing module is set to record the awakening words set by the user according to the user requirements; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
The client side provided by the embodiment of the application, through the mode of providing new awakening word recording data for the server side, the user can set which awakening word to be used to awaken the client side according to self preference, the alternation of the awakening mode is realized through the iteration of the effect, the awakening effect is more guaranteed than the fixed awakening mode, and the user experience is better.
Fig. 2 is a flowchart of an embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 2, the method includes, for a server:
step 200: and performing voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model.
Step 201: and training the defined awakening words by using the audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain the voice awakening model.
Step 202: and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
According to the method for realizing voice synthesis, in the model training stage, the server side can obtain various pronouncing persons with different sound characteristics through training the linguistic data of a large number of different pronouncing persons, and therefore a user can set favorite sound as the pronouncing person to realize voice interaction with the voice interaction equipment when using the voice interaction equipment. In addition, diversified awakening words are generated in the model training stage, and the awakening operation between the user and the voice interaction equipment is enriched.
Fig. 3 is a flowchart of another embodiment of a method for implementing voice interaction according to the present application, as shown in fig. 3, the method includes, for a client:
step 300: and according to the user requirements, selecting the current awakening words and the pronouncing persons for voice interaction from the issued audio training data and the awakening words.
Step 301: awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
By the method for realizing voice interaction, the user can select the awakening words required by the user at the client according to the requirement of the user, and can select the favorite sound as the speaker of the user who subsequently performs voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved.
In an illustrative example, the method for implementing voice interaction in the client may further include:
and downloading part or all of the released audio training data and the awakening words to the client as locally selectable audio training data and awakening words of the client.
In an illustrative example, the method for implementing voice interaction in the client may further include:
recording a wakeup word set by a user according to the user requirement; and uploading the recorded awakening words to a server. Therefore, the voice awakening processing module in the server can further train the new awakening words to perfect the voice awakening model, the voice uploaded by the user can also be used as the corpus information of the real voice of the speaker, and the voice synthesis training of the voice synthesis processing module in the server is used for further perfecting the voice synthesis model.
In this embodiment, the client provides new wake-up word recording data to the server, and the user can set which wake-up word to use to wake up the client according to self preference, and iteration through such effect realizes the variability of the wake-up mode, and the wake-up effect is more guaranteed than a fixed wake-up mode, and user experience is also better.
The application also provides a computer-readable storage medium storing computer-executable instructions for executing any one of the above methods for realizing voice interaction.
The present application further provides a device for implementing voice interaction, including a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of any of the above-described methods for enabling voice interaction.
The embodiment of the present application further provides a method for implementing voice interaction, including:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
Therefore, the user can select the awakening words required by the user at the client according to the requirements of the user, and can select the favorite sounds as the speaker of the user who subsequently performs voice interaction with the client, so that the personalized setting of the voice interaction product is realized, and the user experience is improved.
In one illustrative example, the keyword may be the name of a speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved.
In one illustrative example, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
The client side can set which awakening word to use to awaken the client side according to self preference by providing new awakening word recording data for the server side, and the awakening mode is changeable through iteration of the effect, so that the awakening effect is more guaranteed compared with a fixed awakening mode, and the user experience is better.
The embodiment of the present application further provides a method for implementing voice interaction, including:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
For example, when the voice interaction device receives the voice information from the user, if the voice information from the user matches with the wake-up word in the voice wake-up model, that is, the user uses the wake-up word that can be recognized by the voice interaction device, a request for waking up the voice interaction device is initiated to the voice interaction device, and the voice interaction device is woken up and enters a voice interaction state with the user using the set speaker. The voice awakening model is obtained by training defined awakening words through audio data synthesized by a plurality of speakers obtained through the voice synthesis model, the voice synthesis model is obtained by performing voice synthesis training through corpus information including real voice recordings of the speakers, namely, a user can select the awakening words required by the user at a client according to own requirements, and can select sounds preferred by the user as the speakers subsequently performing voice interaction with the client, so that personalized setting of voice interaction products is realized, and user experience is improved. Such as: the keyword may be the name of the speaker. Therefore, the voice assistant provided by the voice interaction product is awakened by using the name of the new speaker as the awakening word, meanwhile, the voice assistant can also adopt the newly-set speaker to broadcast, so that the awakening and broadcasting consistency is realized, the voice interaction is more personalized by the interaction of the awakening and broadcasting consistency, and the habit of natural interaction is met, so that the use interest of a user on the voice interaction product is improved. Therefore, the monotonous experience that only one mode can be used for awakening all the time in the related technology is changed, and the variable effect of the voice assistant role is realized.
Fig. 4 is a schematic view of an application scenario of the method for implementing voice interaction according to the present application, in this embodiment, taking a voice interaction device as an intelligent sound box as an example, as shown in fig. 4, the intelligent sound box displays a role list and a wakeup word list of a voice assistant for implementing interaction with the voice interaction device to a user; the user selects a voice assistant that the user wants, such as the sound of a certain star that the user likes, from the role list through clicking, or gestures, or eye observation, on one hand, the star can be selected as the current voice assistant, and on the other hand, the user selects a wake-up word of the user's own mind set from the wake-up word list to wake up the smart sound box before interacting with the smart sound box. When the intelligent sound box receives that the awakening word from the user is the previously selected awakening word, the intelligent sound box is awakened; in this way, the smart speaker may invoke the previously selected voice assistant for subsequent voice interaction with the user. The voice interaction with the intelligent sound box is finished by setting the favorite sound as the speaker when the user uses the intelligent sound box. When a user needs to add or change the awakening words, the recorded awakening words can be uploaded to the server through the intelligent sound box. In this way, the server may further train the new wake word to update the wake word list.
Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.
Claims (12)
1. A system for enabling voice interaction, comprising: the system comprises a server and a client; wherein,
the server side comprises: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
the release module is used for releasing audio training data corresponding to different speakers obtained by training and defined awakening words;
the client comprises: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
2. A server, comprising: the voice synthesis processing module, the voice awakening processing module and the release module; wherein,
the voice synthesis processing module is used for carrying out voice synthesis training by utilizing the corpus information comprising the real records of a plurality of speakers to obtain a voice synthesis model;
the voice awakening processing module is used for training the defined awakening words by using the audio data synthesized by the plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and the issuing module is set to issue the audio training data corresponding to different speakers obtained by training and the defined awakening words.
3. A client, comprising: setting a processing module and an interaction module; wherein,
the setting processing module is set to select a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
the interaction module is used for receiving a wake-up word sent by a user and waking up the equipment; and carrying out voice interaction with the user by adopting the speaker selected by the user.
4. A method of enabling voice interaction, comprising:
performing voice synthesis training by utilizing corpus information comprising real records of a plurality of speakers to obtain a voice synthesis model;
training the defined awakening words by using audio data synthesized by a plurality of speakers obtained by the voice synthesis model to obtain a voice awakening model;
and issuing the audio training data corresponding to different speakers obtained by training and the defined awakening words.
5. A method of enabling voice interaction, comprising:
selecting a current awakening word and a speaker for voice interaction from the issued audio training data and the awakening word according to the user requirement;
awakening the equipment according to the received awakening words; and performing voice interaction with the user by adopting the selected speaker.
6. A method of enabling voice interaction, comprising:
displaying audio training data and awakening words corresponding to different speakers to a user;
setting a wake-up word for waking up the voice interaction equipment and a speaker for performing voice interaction with the user according to the selection of the user;
and downloading the corresponding voice synthesis model and voice awakening model according to the set awakening words and the set speaker.
7. The method of claim 6, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
and uploading the recorded awakening words to a server so as to train new awakening words and update the voice awakening model.
8. A method of enabling voice interaction, comprising:
receiving voice information from a user, and awakening the voice interaction equipment when judging that the voice information from the user is matched with the voice awakening module according to the set awakening words and the downloaded voice awakening model;
and using the downloaded voice synthesis model to take the set speaker as a voice assistant of the voice wake-up device to interact with the user.
9. A method of enabling voice interaction, comprising:
displaying a role list and a wake-up word list of a voice assistant for realizing interaction with voice interaction equipment to a user;
determining a wake-up word and a voice assistant interacting with the voice interaction equipment at present according to the selection of a user;
receiving a wake-up word from a user and waking up the voice interaction equipment;
and performing voice interaction with the voice assistant selected by the user.
10. The method of claim 9, further comprising:
receiving a recording instruction from a user, and recording a user-defined awakening word;
uploading the recorded awakening words to a server so as to train new awakening words and update the awakening word list.
11. A smart sound box comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method of claim 9 or 10 for enabling a voice interaction.
12. An apparatus for enabling voice interaction, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the method for enabling a voice interaction of claim 4 or for performing the steps of the method for enabling a voice interaction of claim 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110556758.3A CN113299275A (en) | 2021-05-21 | 2021-05-21 | Method and system for realizing voice interaction, service end, client and intelligent sound box |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110556758.3A CN113299275A (en) | 2021-05-21 | 2021-05-21 | Method and system for realizing voice interaction, service end, client and intelligent sound box |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113299275A true CN113299275A (en) | 2021-08-24 |
Family
ID=77323640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110556758.3A Pending CN113299275A (en) | 2021-05-21 | 2021-05-21 | Method and system for realizing voice interaction, service end, client and intelligent sound box |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113299275A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117012189A (en) * | 2022-04-29 | 2023-11-07 | 荣耀终端有限公司 | Voice recognition method and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106971719A (en) * | 2017-05-16 | 2017-07-21 | 上海智觅智能科技有限公司 | A kind of offline changeable nonspecific sound speech recognition awakening method for waking up word |
CN109346083A (en) * | 2018-11-28 | 2019-02-15 | 北京猎户星空科技有限公司 | A kind of intelligent sound exchange method and device, relevant device and storage medium |
CN109545194A (en) * | 2018-12-26 | 2019-03-29 | 出门问问信息科技有限公司 | Wake up word pre-training method, apparatus, equipment and storage medium |
CN109584860A (en) * | 2017-09-27 | 2019-04-05 | 九阳股份有限公司 | A kind of voice wakes up word and defines method and system |
-
2021
- 2021-05-21 CN CN202110556758.3A patent/CN113299275A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106971719A (en) * | 2017-05-16 | 2017-07-21 | 上海智觅智能科技有限公司 | A kind of offline changeable nonspecific sound speech recognition awakening method for waking up word |
CN109584860A (en) * | 2017-09-27 | 2019-04-05 | 九阳股份有限公司 | A kind of voice wakes up word and defines method and system |
CN109346083A (en) * | 2018-11-28 | 2019-02-15 | 北京猎户星空科技有限公司 | A kind of intelligent sound exchange method and device, relevant device and storage medium |
CN109545194A (en) * | 2018-12-26 | 2019-03-29 | 出门问问信息科技有限公司 | Wake up word pre-training method, apparatus, equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117012189A (en) * | 2022-04-29 | 2023-11-07 | 荣耀终端有限公司 | Voice recognition method and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10381016B2 (en) | Methods and apparatus for altering audio output signals | |
JP6505117B2 (en) | Interaction of digital personal digital assistant by replication and rich multimedia at response | |
Boyer | From media anthropology to the anthropology of mediation | |
Cassidy et al. | Noise in and as Music | |
JP2015517684A (en) | Content customization | |
CN112188266A (en) | Video generation method and device and electronic equipment | |
JP2024523812A (en) | Audio sharing method, device, equipment and medium | |
JP2019015951A (en) | Wake up method for electronic device, apparatus, device and computer readable storage medium | |
Pauletto et al. | Exploring expressivity and emotion with artificial voice and speech technologies | |
CN109064787A (en) | Point reading equipment | |
CN113299275A (en) | Method and system for realizing voice interaction, service end, client and intelligent sound box | |
CN112672207B (en) | Audio data processing method, device, computer equipment and storage medium | |
CN112381926B (en) | Method and device for generating video | |
Cortez | Museums as sites for displaying sound materials: a five-use framework | |
Krewani | McLuhan's Global Village Today: Transatlantic Perspectives | |
CN104681048A (en) | Multimedia read control device, curve acquiring device, electronic equipment and curve providing device and method | |
Gritten | Depending on timbre | |
Guimarães | Symbolic objects as sediments of the intersubjective stream of feelings | |
CN116564272A (en) | Method for providing voice content and electronic equipment | |
Sterne | Multimodal scholarship in world soundscape project composition: Toward a different media-theoretical legacy (or: The WSP as OG DH) | |
CN112114770A (en) | Interface guiding method, device and equipment based on voice interaction | |
Cayley | Aurature at the End (s) of Electronic Literature | |
Erkut et al. | From ecological sounding artifacts towards sonic artifact ecologies | |
Lewis | Ventriloquial acts: Critical reflections on the art of Foley | |
O’Halloran | Posthumanist stylistics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20240305 Address after: # 03-06, Lai Zan Da Building 1, 51 Belarusian Road, Singapore Applicant after: Alibaba Innovation Co. Country or region after: Singapore Address before: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore Applicant before: Alibaba Singapore Holdings Ltd. Country or region before: Singapore |
|
TA01 | Transfer of patent application right |