CN113012681B

CN113012681B - Awakening voice synthesis method based on awakening voice model and application awakening method

Info

Publication number: CN113012681B
Application number: CN202110190523.7A
Authority: CN
Inventors: 彭金华; 李牧之; 陈潮涛; 姜迪
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2024-05-17
Anticipated expiration: 2041-02-18
Also published as: CN113012681A

Abstract

The application provides a wake-up voice synthesis method based on a wake-up voice model, an application wake-up method, a device, electronic equipment, a computer readable storage medium and a computer program product; the wake-up voice model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, and the wake-up voice synthesis method based on the wake-up voice model comprises the following steps: extracting voiceprint characteristics of voices of different users through a voiceprint extraction layer to obtain user voiceprints; performing phoneme conversion on the awakened text through a phoneme conversion layer to obtain an awakened phoneme sequence; based on the voiceprint and wake-up phoneme sequences of the user, carrying out voice parameter prediction through a prediction layer to obtain corresponding predicted voice parameters; performing voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice; the wake-up voice is used for waking up the target program when the matching with the voice to be matched is successful. The application can efficiently generate the wake-up voice with anthropomorphic effect, thereby saving time and labor cost.

Description

Awakening voice synthesis method based on awakening voice model and application awakening method

Technical Field

The present application relates to a voice wake-up technology, and in particular, to a wake-up voice synthesis method based on a wake-up voice model and an application wake-up method.

Background

The voice wake-up means that the user wakes up the electronic device by speaking a wake-up word, so that the electronic device enters a state of waiting for a voice instruction or the electronic device directly executes a predetermined voice instruction.

In the related art, a mode of realizing voice awakening is to record a large amount of voices aiming at awakening words manually, and then perform voice matching based on the manually recorded voices so as to perform corresponding voice awakening, however, the mode requires a large amount of personnel to perform voice recording work, and the time and labor cost are high.

Disclosure of Invention

The embodiment of the application provides a wake-up voice synthesis method based on a wake-up voice model, an application wake-up method, a device, electronic equipment, a computer readable storage medium and a computer program product, which can efficiently generate wake-up voice with anthropomorphic effect and save time and labor cost.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a wake-up voice synthesis method based on a wake-up voice model, wherein the wake-up voice model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, and the method comprises the following steps:

extracting voiceprint characteristics of voices of different users through the voiceprint extraction layer to obtain corresponding user voiceprints;

Performing phoneme conversion on the awakened text through the phoneme conversion layer to obtain a corresponding awakened phoneme sequence;

based on the user voiceprint and the wake-up phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters;

performing voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice;

the wake-up voice is used for waking up the target program when the voice to be matched is successfully matched with the voice to be matched.

The embodiment of the application provides a wake-up voice synthesis device based on a wake-up voice model, wherein the wake-up voice model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, and the device comprises:

The voiceprint extraction module is used for extracting voiceprint characteristics of the user voice through the voiceprint extraction layer to obtain corresponding user voiceprints;

the phoneme conversion module is used for carrying out phoneme conversion on the awakened text through the phoneme conversion layer to obtain a corresponding awakened phoneme sequence;

the prediction module is used for predicting voice parameters based on the user voiceprint and the wake-up phoneme sequence through the prediction layer to obtain corresponding predicted voice parameters;

the voice synthesis module is used for carrying out voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice;

In the above scheme, the speech synthesis module is further configured to perform spectrum conversion based on the predicted speech parameter to obtain a corresponding speech spectrum; and performing voice coding based on the voice frequency spectrum to obtain corresponding wake-up voice.

In the above scheme, the wake-up voice synthesis device based on the wake-up voice model further includes: the voice matching module is used for acquiring a negative example phoneme sequence which is not matched with the awakening text; based on the negative example phoneme sequence, performing voice synthesis to obtain negative example voice corresponding to the awakening text; responding to a voice matching request aiming at voice to be matched, and respectively matching the voice to be matched with the wake-up voice and the negative example voice to obtain a matching result; and sending the matching result so as not to wake up the target program when the matching result characterizes that the voice to be matched is successfully matched with the negative example voice.

In the above scheme, the phoneme conversion module is further configured to obtain at least one of a re-reading parameter and a pause parameter as a pronunciation parameter; and carrying out phoneme conversion on the wake-up text through the phoneme conversion layer based on the pronunciation parameters to obtain a corresponding wake-up phoneme sequence.

In the above scheme, the wake-up voice synthesis device based on the wake-up voice model further includes: the generalization module is used for acquiring at least one of a speech speed parameter, a pitch parameter and a volume parameter as a speech generalization parameter; and performing voice generalization processing on the wake-up voice based on the voice generalization parameters to obtain the generalized wake-up voice.

In the above scheme, the wake-up voice synthesis device based on the wake-up voice model further includes: the voice matching module is used for receiving a voice matching request aiming at the voice to be matched; responding to the voice matching request, and comparing the waveform characteristics of the voice to be matched with the waveform characteristics of the wake-up voice to determine the similarity of the voice to be matched and the wake-up voice; and sending the determined similarity to wake up the target program when the similarity reaches a similarity threshold.

In the above scheme, the wake-up voice synthesis device based on the wake-up voice model further includes: the voice classification model training module is used for acquiring wake-up voice carrying a first classification label and negative-example voice carrying a second classification label; the first classification label indicates that the wake-up voice is matched with the wake-up text, and the second classification label indicates that the negative example voice is not matched with the wake-up text; constructing a training sample set based on wake-up voice carrying a first classification label and negative-example voice carrying a second classification label, and training a voice classification model based on the training sample set; the voice classification model is used for classifying input voices to be matched and outputting classification results of whether the input voices to be matched are matched with the wake-up text or not.

In the above scheme, the wake-up voice model further includes a voice synthesis layer, and the voice synthesis module is further configured to perform voice synthesis through the voice synthesis layer based on the predicted voice parameter, so as to obtain a corresponding wake-up voice.

In the above scheme, the wake-up voice model training module is configured to perform voiceprint feature extraction on the sample voice through the voiceprint extraction layer to obtain a corresponding sample voiceprint; wherein the sample voice carries a voice parameter tag; performing phoneme conversion on the sample text corresponding to the sample speech through the phoneme conversion layer to obtain a corresponding sample phoneme sequence; based on the sample voiceprint and the sample phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; updating model parameters of the wake-up speech model based on differences between the predicted speech parameters and the speech parameter tags.

The embodiment of the application provides an application awakening method, which comprises the following steps:

Receiving an application wake-up instruction carrying a voice to be matched, wherein the application wake-up instruction is used for indicating a wake-up target program;

responding to the application wake-up instruction, and matching the voice to be matched with wake-up voice;

The wake-up voice is obtained by voice synthesis based on predicted voice parameters, and the predicted voice parameters are obtained by voice parameter prediction based on voices of different users and wake-up texts through a wake-up voice model;

and waking up the target program when the voice to be matched is successfully matched with the wake-up voice.

The embodiment of the application provides an application awakening device, which comprises:

The receiving module is used for receiving an application wake-up instruction carrying voice to be matched, wherein the application wake-up instruction is used for indicating a wake-up target program;

The matching module is used for responding to the application wake-up instruction and matching the voice to be matched with the wake-up voice;

and the wake-up module is used for waking up the target program when the voice to be matched is successfully matched with the wake-up voice.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the wake-up voice synthesis method based on the wake-up voice model or applying the wake-up method when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for realizing the wake-up voice synthesis method based on the wake-up voice model or applying the wake-up method when being executed by a processor.

The embodiment of the application provides a computer program product, which comprises a computer program, wherein the computer program realizes the wake-up voice synthesis method based on the wake-up voice model or applies the wake-up method when being executed by a processor.

The embodiment of the application has the following beneficial effects:

Compared with the mode of manually recording the voice for realizing voice awakening in the related art, the embodiment of the application extracts the user voiceprint from the voices of different users, converts the awakening text into the awakening phoneme sequence, synthesizes the awakening voice for voice matching based on the user voiceprint and the awakening phoneme sequence, and can efficiently customize and synthesize the awakening voice with anthropomorphic effects with different user tone colors by combining the user voiceprint of different users to perform voice synthesis, thereby replacing the manual voice recording work for the awakening text, overcoming the defects of high time and labor cost caused by the mode of manually recording in the related art, and saving time and labor cost.

Drawings

FIG. 1 is a schematic diagram of an alternative configuration of a wake-up speech synthesis system based on a wake-up speech model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative configuration of an electronic device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative configuration of a wake-up speech model provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative configuration of a wake-up speech model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative configuration of a wake-up speech model provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative architecture of a speech classification model provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart of an alternative wake-up method according to an embodiment of the present application;

FIG. 11 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application;

FIG. 12 is an alternative schematic diagram of a wake text setup interface provided by an embodiment of the present application;

FIG. 13 is an alternative schematic diagram of a wake text input interface provided by an embodiment of the present application;

FIG. 14 is an alternative schematic diagram of a target wakeup interface according to an embodiment of the present application;

FIG. 15 is a schematic diagram of an alternative configuration of a wake-up speech synthesis apparatus based on a wake-up speech model according to an embodiment of the present application;

fig. 16 is a schematic diagram of an alternative configuration of a wake-up speech synthesis apparatus based on a wake-up speech model according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Voiceprint (Voiceprint), the spectrum of sound waves carrying verbal information displayed with electro-acoustic instrumentation. Voiceprints are not only specific, but also relatively stable. Based on these two features of the voiceprint, the voiceprint can be utilized to identify the user identity. The voiceprint can in turn comprise: broadband voiceprint, narrowband voiceprint, amplitude voiceprint, contour voiceprint, time spectrum voiceprint, cross-section voiceprint (again, broadband, narrowband). Wherein the wideband voiceprint and the narrowband voiceprint display the change characteristics of the frequency and the intensity of the voice over time; the amplitude voiceprint, the contour voiceprint and the time spectrum voiceprint show the characteristic that the voice intensity or the sound pressure changes along with time; the cross section voiceprint shows a voiceprint of the intensity and frequency characteristics of the sound wave at a certain point in time.

2) The speech rate, i.e. the presentation rate of Words or language symbols of human expressed meaning Per unit time, may be expressed as the number of Words presented Per Minute, in Words Per Minute (WPM), and also as the number of syllables presented Per Minute, in syllables Per Minute (SPM, syllables Per Minute).

3) Pitch refers to one of the basic characteristics of various sounds with different tone heights, namely the height of the sound. The nature of sound is a mechanical wave, so the level of sound is determined by the frequency of the mechanical wave, and the sound velocity is fixed and related to the wavelength, and the sound is "high" when the frequency is high and the wavelength is short, whereas the sound is "low" when the frequency is low and the wavelength is long. The unit of pitch is Hertz (HZ).

4) The volume, also called sound intensity and loudness, refers to subjective feeling of the ear to the intensity of the sound heard, and the objective evaluation scale is the amplitude of the sound. The unit of volume is decibel.

5) Negative text refers to text that does not match the wake text. The phoneme sequence obtained by converting the negative example text is a negative example phoneme sequence, and the voice obtained by synthesizing the voice based on the negative example phoneme sequence is a negative example voice.

For example, if the wake-up text is "Xiaoming", the negative text may be "Xiaoming", "ShouMing", "Xiaoming", or "Dadingmu", etc.

6) Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In order to save labor and time cost, the method adopted in the related art is to match keywords based on voice recognition, namely, firstly, voice recognition is carried out on voice to be matched, voice texts in the voice to be matched are obtained through recognition, and then the voice texts are matched with wake-up texts. However, the inventor finds that the real-time calculation amount required by the keyword text matching method through voice recognition is very large, and the equipment needs to collect voice in real time in the voice wake-up scene, so that the calculation pressure of the equipment is very large, the power consumption is very high, and the equipment is easy to generate heat.

Based on the above, the embodiment of the application provides a wake-up voice synthesis method, an application wake-up device, an electronic device, a computer readable storage medium and a computer program product based on a wake-up voice model, which can efficiently generate wake-up voice with anthropomorphic effect, and save time and labor cost.

Referring to fig. 1, fig. 1 is a schematic diagram of an optional architecture of a wake-up speech synthesis system 100 based on a wake-up speech model according to an embodiment of the present application, where a terminal 400 is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to implement data transmission. In some embodiments, the terminal 400 may be, but is not limited to, a notebook computer, tablet computer, desktop computer, smart phone, dedicated messaging device, portable game device, smart speaker, smart watch, smart home device, smart car device, etc. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The network 300 may be a wide area network or a local area network, or a combination of both. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

The terminal 400 is configured to obtain the wake-up text and voices of different users, generate wake-up voice synthesis instructions based on the wake-up text and voices of different users, and send the wake-up voice synthesis instructions to the server 200.

The server 200 is configured to perform voiceprint feature extraction on the voice of the user by waking up the voiceprint extraction layer of the voice model to obtain a corresponding user voiceprint; performing phoneme conversion on the awakened text through a phoneme conversion layer of the awakened voice model to obtain a corresponding awakened phoneme sequence; based on the user voiceprint and the wake-up phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; and performing voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice, and transmitting the wake-up voice to the terminal 400.

The terminal 400 is further configured to collect the voice to be matched, match the voice to be matched with the wake-up voice, and wake up the target program when the voice to be matched is successfully matched with the wake-up voice.

Referring to fig. 2, fig. 2 is a schematic diagram of an alternative structure of an electronic device 500 according to an embodiment of the present application, in which the electronic device 500 may be implemented as the terminal 400 or the server 200 in fig. 1, and an electronic device implementing a wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application is described by taking the electronic device as the server 200 shown in fig. 1 as an example. The electronic device 500 shown in fig. 2 includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that bus system 540 is used to facilitate connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.

The Processor 510 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.

Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 550 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

Network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

The input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.

In some embodiments, the wake-up speech synthesis device based on the wake-up speech model provided by the embodiments of the present application may be implemented in a software manner, and fig. 2 shows the wake-up speech synthesis device 555 based on the wake-up speech model stored in the memory 550, which may be software in the form of a program, a plug-in, or the like, and includes the following software modules: voiceprint extraction module 5551, phoneme conversion module 5552, prediction module 5553, and speech synthesis module 5554 are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be described hereinafter.

In other embodiments, the wake-up speech synthesis apparatus based on a wake-up speech model provided by the embodiments of the present application may be implemented in hardware, and by way of example, the wake-up speech synthesis apparatus based on a wake-up speech model provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor, which is programmed to perform the wake-up speech synthesis method based on a wake-up speech model provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more Application specific integrated circuits (ASICs, application SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components.

The wake-up speech synthesis method based on the wake-up speech model provided by the embodiment of the application will be described below in connection with exemplary applications and implementations of the server provided by the embodiment of the application. Referring to fig. 3, fig. 3 is an optional structural diagram of a wake-up speech model according to an embodiment of the present application, where the wake-up speech model includes a voiceprint extraction layer, a phoneme conversion layer, and a prediction layer. Here, the wake-up Speech model is a Text-to-Speech (TTS) model for implementing Text-to-Speech conversion. Referring to fig. 4, fig. 4 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

Step 101, the server extracts voiceprint characteristics of voices of different users through the voiceprint extraction layer to obtain corresponding user voiceprints.

Here, the user's voice may be obtained from a public voice database, such as the voice dataset disclosed by Aishell, which contains thousands of different users' voice data. In some embodiments, the user voices may also be collected by the server based on the microphone connected by the server in communication, so as to obtain the user voices of a plurality of different users. In practical implementation, the server extracts voiceprint features of each user voice through the voiceprint feature extraction layer to obtain corresponding user voiceprints. Here, the user voiceprint extracted by the voiceprint feature extraction layer may be a voiceprint map, in particular a vector representation of the voiceprint map. The extracted voiceprint map comprises the frequency value of the sound wave and the characteristics of the sound wave, such as the time-varying characteristics, the duration of the sound wave, the intensity of sound and the waveform.

In practical implementation, the voiceprint feature extraction layer may update parameters during the process of training the wake-up voice model, or may be obtained by training another model, for example, by training a voiceprint recognition model, and then intercepting a part of the model structure of the output voiceprint feature in the voiceprint recognition model as the voiceprint feature extraction layer in the embodiment of the present application, where it is understood that the voiceprint feature extraction layer is obtained by training separately from the wake-up voice model.

And 102, performing phoneme conversion on the wake-up text through the phoneme conversion layer to obtain a corresponding wake-up phoneme sequence.

Here, the wake-up text may be any content that is customized. In actual implementation, wake-up text and application binding settings, one application may correspond to multiple wake-up texts. For example, for a map application, a wake-up text may be set to "open a map", "open a navigation", etc., for a music application, a wake-up text may be set to "xx music", "listen to a song", etc., or for an artificial intelligence speech application, a wake-up text of any content may be set, and when a user utters a speech corresponding to the wake-up text, the artificial intelligence speech application is woken up to perform a conversation.

In the embodiment of the application, the server performs phoneme conversion on the awakened text through the phoneme conversion layer of the awakened speech model to obtain a corresponding awakened phoneme sequence. Here, the phoneme conversion layer may be obtained as part of the wake speech model when training the wake speech model, or may be independent of other algorithms or model structures other than the wake speech model.

In some embodiments, based on fig. 4, step 102 may also be implemented by: the server acquires at least one of the re-reading parameters and the pause parameters as a pronunciation parameter; and carrying out phoneme conversion on the wake-up text through the phoneme conversion layer based on the pronunciation parameters to obtain a corresponding wake-up phoneme sequence.

Here, the rereading parameter is used to indicate the location of the rereading in the text. For the wake-up text of ABAB, the reread parameter may be indicative of rereading the first word and the third word in the wake-up text, represented by adding a reread tag to the wake-up textHere, the symbol "·" placed under the word in the wake-up text is then a read-again flag. The rereading mark is used for indicating a rereading object in the awakening text, wherein the rereading object is a word or word in the awakening text. The pause parameter is used to indicate a pause location in the text. For example, for ABAB, the reread parameter may be an indication of a pause between the second word and the third word in the wake-up text, denoted AB/AB by adding a pause flag to the wake-up text, where the symbol "/" set in the wake-up text is the pause flag.

In practical implementation, different wake-up phoneme sequences can be correspondingly converted by setting different pronunciation parameters for the wake-up text. For example, for ABAB this wake-up text, the dwell positions set for it may be ABAB, a/BAB, AB/AB, ABA/B, A/B/AB, a/BA/B, AB/a/B, A/B/a/B, etc., where the symbol "/" is a dwell flag. The rereading position set for the device can be ABAB, Etc.

In practical implementation, when the pronunciation parameters include the re-reading parameters, the server performs phoneme conversion on the wake-up text based on the re-reading parameters to obtain a wake-up phoneme sequence with re-reading marks. For example, for the wake-up text of "Ming", the wake-up phoneme sequence generated based on the re-reading parameters may be When the pronunciation parameters comprise pause parameters, the server performs phoneme conversion on the wake-up text based on the pause parameters to obtain a wake-up phoneme sequence carrying a phoneme pause mark. For example, for the wake-up text of "Xiaoming Ming", the wake-up phoneme sequence generated based on the pause parameter may be "xiao3 mine 2". Here, the wake-up phoneme sequence is set with a pause flag by setting a space in the wake-up phoneme sequence. In some embodiments, when the pronunciation parameters include the re-reading parameters and the pause parameters, the server performs phoneme conversion on the wake-up text based on the pronunciation parameters to obtain a wake-up phoneme sequence carrying the re-reading marks and the pause parameters. For example, for the wake-up text "Ming", if the pause parameter indicates that the pause location is set between the second word and the third word, the reread parameter indicates that the reread location is set to the first word and the third word, i.e. >The wake-up phoneme sequence obtained after the server performs phoneme conversion on the wake-up text can be expressed as/>

In the embodiment of the application, the pause position and the rereading position are used as a group of pronunciation parameters, and in actual implementation, the server acquires a plurality of groups of pronunciation parameters, and performs corresponding phoneme conversion based on each group of pronunciation parameters so as to obtain a corresponding wake-up voice sequence.

And step 103, based on the voiceprint of the user and the wake-up phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters.

In actual implementation, the speech parameters include, but are not limited to, duration, pitch, and prosody. The server predicts voice parameters such as duration, pitch, rhythm and the like of the wake-up voice through a prediction layer of the wake-up voice model based on the voiceprint and wake-up phoneme sequence of the user. In some embodiments, referring to fig. 5, fig. 5 is an optional structural schematic diagram of a wake-up speech model provided by the embodiment of the present application, where a prediction layer of the wake-up speech model includes a duration prediction layer, a pitch prediction layer, a prosody prediction layer, and the like, and the wake-up speech model predicts corresponding duration, pitch, and prosody through the duration prediction layer, the pitch prediction layer, and the prosody prediction layer, respectively.

It should be noted that, in the embodiment of the present application, the number of user voices is multiple, the server extracts voiceprint features for each user voice to obtain a corresponding multiple user voiceprints, and then, for each user voiceprint, performs voice parameter prediction by combining with a wake-up voice sequence corresponding to each set of pronunciation parameters, to obtain a corresponding set of predicted voice parameters. For example, if there are m user voices and n sets of pronunciation parameters, the predicted voice parameters obtained by prediction have m×n sets in total. Wherein m and n are positive integers greater than or equal to 2.

And 104, performing voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice. The wake-up voice is used for waking up the target program when the voice to be matched is successfully matched with the voice to be matched.

In some embodiments, based on fig. 4, step 104 may also be implemented by: the server performs frequency spectrum conversion based on the predicted voice parameters to obtain corresponding voice frequency spectrum; and performing voice coding based on the voice frequency spectrum to obtain corresponding wake-up voice.

It should be noted that, when the predicted speech parameters output by the prediction layer of the wake-up speech model are vector representations, referring to fig. 5, in the actual implementation, the server decodes the predicted speech parameters represented by the vectors through the decoder to obtain decoded speech parameters, and then draws a speech spectrum based on the decoded speech parameters to obtain a corresponding speech spectrum. Here, the voice spectrum may be a mel spectrum, and the server draws a corresponding mel spectrum according to the predicted duration, pitch, rhythm, and the like of the wake-up voice, so as to obtain a corresponding mel spectrum chart. Then, the server carries out voice coding on the voice frequency spectrum through a voice coder, and carries out voice signal conversion on the voice frequency spectrum to obtain wake-up voice. It should be understood that the wake-up speech obtained here is speech that incorporates pronunciation parameters for the wake-up text. After the server obtains the wake-up voice, the server also carries out denoising filtering processing on the wake-up voice. It should be noted that, the server generates a corresponding wake-up voice for each set of predicted voice parameters. For example, if the number of predicted speech parameters is m×n groups, the generated wake-up speech is m×n pieces.

In some embodiments, based on fig. 4, it is also possible to perform: the server acquires at least one of a speech rate parameter, a pitch parameter and a volume parameter as a speech generalization parameter; and performing voice generalization processing on the wake-up voice based on the voice generalization parameters to obtain the generalized wake-up voice.

Here, the speech rate parameter is a speech rendering rate of the wake-up phoneme sequence, and may be the number of words to be rendered per minute, the number of syllables to be rendered per minute, or the like. In actual implementation, the server acquires at least one of the speech speed parameter, the pitch parameter and the volume parameter as a speech generalization parameter, and converts at least one of the speech speed, the pitch and the volume of the wake-up speech into a value in the corresponding speech generalization parameter to obtain the generalized wake-up speech. Illustratively, the speech rate parameter may be 200 words/minute, the pitch parameter may be 100Hz, the volume parameter may be 60 dB, etc.

In some embodiments, after synthesizing the wake-up speech, the server may also randomly adjust at least one of a speech rate, a pitch, and a volume of the wake-up speech. For example, the server may randomly increase or decrease the volume of the wake-up voice, randomly increase or decrease the pitch of the wake-up voice, or randomly increase or decrease the speed of sound of the wake-up voice to perform the voice generalization process on the wake-up voice. In some embodiments, the server may also perform a voice generalization process by randomly performing a voice format conversion on the wake-up voice, for example, converting the voice format of the wake-up voice from a raw format to a target format and then reconverting from the target format to the raw format, so as to change the storage size of the wake-up voice, or randomly converting the channels of the wake-up voice, for example, converting the wake-up voice of two channels to a mono, and so on.

In some embodiments, referring to fig. 6, fig. 6 is an optional structural schematic diagram of a wake-up voice model provided in an embodiment of the present application, where the wake-up voice model further includes a voice synthesis layer. Referring to fig. 5, the speech synthesis layer of the embodiment of the present application may include the decoder and the speech encoder shown in fig. 5. Based on fig. 4, step 104 may also be implemented as follows: and the server performs voice synthesis through the voice synthesis layer based on the predicted voice parameters to obtain corresponding wake-up voice.

In some embodiments, referring to fig. 7, fig. 7 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application, based on fig. 4, may further be executed:

In step 201, the server receives a voice matching request for a voice to be matched.

In actual implementation, the voice matching request may be sent to the server through the terminal. When the microphone of the terminal collects the voice to be matched, a voice matching request is generated aiming at the voice to be matched, and the voice matching request is sent to the server. It should be understood that the voice to be matched is any voice signal collected by the terminal, which may be a voice sent by a user, an environmental voice, and the like.

Step 202, responding to the voice matching request, comparing the waveform characteristics of the voice to be matched with the waveform characteristics of the wake-up voice to determine the similarity of the voice to be matched and the wake-up voice.

In actual implementation, after receiving the voice matching request, the server responds to the voice matching request to analyze the voice matching request, so as to obtain the voice to be matched carried by the voice matching request. Then, the server extracts waveform characteristics of the voice to be matched and the wake-up voice respectively, and obtains the waveform characteristics of the voice to be matched and the waveform characteristics of the wake-up voice. Here, waveform characteristics include, but are not limited to, the frequency of the waveform, the spacing between peaks, the spacing between valleys, and the spacing between peaks and valleys, etc. In the embodiment of the application, the server directly compares the waveform characteristics of the voice to be matched with the waveform characteristics of the wake-up voice, and the similarity between the waveform characteristics is used as the similarity between the voice to be matched and the wake-up voice.

In some embodiments, the server may further perform spectral conversion on the voice to be matched to obtain a spectrogram of the voice to be matched, perform spectral conversion on the wake-up voice to obtain a spectrogram of the wake-up voice, and then perform image comparison on the spectrogram of the voice to be matched and the spectrogram of the wake-up voice to obtain a similarity between the two spectrograms. Before comparing the two spectrograms, the server also performs expansion adjustment on the size, the length and the width of the image to adjust the voice waveforms of the two spectrograms to be in a similar size range, and then compares the similarity of the voice waveforms in the spectrograms based on the comparison.

And 203, transmitting the determined similarity to wake up the target program when the similarity reaches a similarity threshold.

In practical implementation, the server sends the similarity between the voice to be matched and the wake-up voice to the terminal, and when the similarity reaches a similarity threshold, the terminal wakes up the target program matched with the wake-up voice. It should be noted that, in the embodiment of the present application, a wake-up text is set for at least one application, if the wake-up speech is a speech generated based on the wake-up text set by the application a, when the similarity between the speech to be matched and the wake-up speech reaches a similarity threshold, the terminal wakes up the application a bound to the wake-up text.

In some embodiments, based on fig. 4, it is also possible to perform: the server acquires a negative case phoneme sequence which is not matched with the awakening text; based on the negative example phoneme sequence, performing voice synthesis to obtain negative example voice corresponding to the awakening text; responding to a voice matching request aiming at voice to be matched, and respectively matching the voice to be matched with the wake-up voice and the negative example voice to obtain a matching result; and sending the matching result so as not to wake up the target program when the matching result characterizes that the voice to be matched is successfully matched with the negative example voice.

In actual implementation, the server may generate a corresponding negative example phoneme sequence through the phoneme conversion layer. Specifically, the server may determine, for the wake-up text, a corresponding negative example text, and then perform phoneme conversion on the negative example text through the phoneme conversion layer, to obtain a corresponding negative example speech. The server may generate the corresponding negative text by randomly removing at least one word in the wake-up text, for example, if the wake-up text is ABAB, the obtained negative text may be A, B, AA, BB, AB, ABA, AAB, ABB. The server may also generate a corresponding negative example text by randomly replacing at least one word in the wake-up text, for example, if the wake-up text is ABAB, the obtained negative example text may be CBAB, ADAB, ABEB, ABAF, ABGH, AIAJ or the like. The server may also randomly retrieve text that is not related to the wake text as negative text, such as XYZ, CDGZU, TTVV. For negative text, the server may randomly add pause locations and rereading locations to generate corresponding negative phoneme sequences. In addition, the server may also perform random replacement of phonemes on the wake-up phoneme sequence generated based on the wake-up text to obtain a corresponding negative case phoneme sequence, for example, for the wake-up text of "Xiaoming", one wake-up phoneme sequence obtained based on the wake-up text by the server is "x iao3m ing2 x iao3m ing2", and then the first phoneme "m" of the wake-up text may be replaced with "j" to obtain a negative case phoneme sequence of "j iao3m ing2 xiao 3m ing2", or the second phoneme "iao" of the wake-up phoneme sequence is replaced with "iong" to obtain a negative case phoneme sequence of "x iong1 ming2 x iao3m ing 2".

Then, the server synthesizes the voice based on the negative example phoneme sequence to obtain the corresponding negative example voice. It should be appreciated that the speech content of the negative example speech obtained here does not match the content of the wake-up text. So far, the server obtains wake-up voice and negative example voice aiming at the wake-up text. Here, the ratio of the wake up speech and the negative example speech may be preset, for example, from 1:2-1:10, selecting a proportion to generate wake-up voice and negative example voice.

Then, the server responds to the voice matching request aiming at the voice to be matched, and the voice to be matched is matched with all wake-up voices and negative example voices in sequence, so that a corresponding matching result is obtained. Here, the matching result may include similarities with all voices, including similarities with all wake voices and similarities with all negative examples voices. The matching result may also be the voice with the highest similarity with the voice to be matched, and the server sends the voice and the similarity between the voice and the voice to be matched to the terminal. In some embodiments, the server further compares the similarity with a similarity threshold, if the similarity is greater than the similarity threshold, generates a corresponding matching result and sends the corresponding matching result to the terminal, the matching result indicates whether the voice to be matched is matched with the wake-up text, if the voice which has the highest similarity with the voice to be matched and reaches the similarity threshold is the wake-up voice, the server generates a matching result which is successfully matched with the wake-up text, if the voice which has the highest similarity with the voice to be matched and reaches the similarity threshold is a negative example voice, the server generates a matching result which is unsuccessful with the wake-up text, or when the similarity with all voices to be matched is the similarity threshold, the server generates a matching result which is unsuccessful. And the terminal wakes up the target program of the army after receiving the matching result.

In some embodiments, the server, after obtaining the negative example voice, also performs data enhancement on the negative example voice. In the embodiment of the present application, the data enhancement mode for the negative example voice is the same as the data enhancement mode for the wake-up voice, and will not be described here again.

In the embodiment of the application, the situation of mismatching when only the wake-up voice is matched is avoided by setting the negative example voice, and the probability of matching to the wake-up voice is reduced by matching with the negative example voice, so that the miswake rate of the target program is reduced.

In some embodiments, the server may also obtain ambient speech, such as nonsensical background noise. When matching is carried out on the voice to be matched, the server matches the voice to be matched based on the wake-up voice, the negative example voice and the environment voice, when the voice to be matched is matched with the wake-up voice, a matching result which is successfully matched with the wake-up voice is generated, and when the voice to be matched is matched with the negative example voice, the environment voice or is not successfully matched with any voice, a matching result which is not successfully matched with the wake-up voice is generated.

In some embodiments, based on fig. 4, prior to step 101, it may also be performed: the server extracts voiceprint characteristics of the sample voice through the voiceprint extraction layer to obtain corresponding sample voiceprints; wherein the sample voice carries a voice parameter tag; performing phoneme conversion on the sample text corresponding to the sample speech through the phoneme conversion layer to obtain a corresponding sample phoneme sequence; based on the sample voiceprint and the sample phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; updating model parameters of the wake-up speech model based on differences between the predicted speech parameters and the speech parameter tags.

In the embodiment of the application, the voice parameter label carried by the sample voice is extracted from the sample voice. The text content of the sample text is the text content corresponding to the sample voice. In actual implementation, the server updates model parameters of the wake-up voice model based on differences between voice parameter labels corresponding to the sample voices and predicted voice parameters predicted by the wake-up voice model. In the embodiment of the application, the server specifically updates parameters of the voiceprint extraction layer, the phoneme conversion layer and the prediction layer. In some embodiments, if the voiceprint feature extraction layer in the wake-up speech model is trained based on the voiceprint recognition model and the phoneme conversion layer is a separate phoneme conversion algorithm or is trained based on a separate phoneme conversion model, the server only updates parameters of the prediction layer of the wake-up speech model.

In some embodiments, the server may implement training of the wake speech model by:

The server determines the difference between the voice parameter label and the predicted voice parameter predicted by the wake-up voice model by calculating the value of the loss function, determines a corresponding error signal based on the loss function when the value of the loss function reaches a first threshold value, reversely propagates the error signal in the wake-up voice model, and updates model parameters of each layer of the wake-up voice model in the propagation process.

The back propagation is described, a training sample is input into an input layer of a neural network model, passes through a hidden layer, finally reaches an output layer and outputs a result, which is a forward propagation process of the neural network model, and as the output result of the neural network model has an error with an actual result, the error between the output result and the actual value is calculated, and the error is propagated back from the output layer to the hidden layer until the error is propagated to the input layer, and in the back propagation process, the value of a model parameter is adjusted according to the error; the above process is iterated until convergence. Taking the first Loss function as an example, the server determines a first error signal based on the first Loss function, the first error signal is reversely propagated from an output layer of the neural network model, the first error signal is reversely propagated layer by layer, when the first error signal reaches each layer, the gradient (namely, the partial derivative of the Loss function on the parameters of the layer) is solved by combining the conducted first error signals, and the parameters of the layer are updated to corresponding gradient values.

In some embodiments, when the voice wake model further includes a voice synthesis layer, based on fig. 4, before step 101, it may further be performed: the server extracts voiceprint characteristics of the sample voice through the voiceprint extraction layer to obtain corresponding sample voiceprints; performing phoneme conversion on the sample text corresponding to the sample speech through the phoneme conversion layer to obtain a corresponding sample phoneme sequence; based on the sample voiceprint and the sample phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; based on the predicted voice parameters, performing voice synthesis through the voice synthesis layer to obtain corresponding predicted voice; based on the difference between the predicted speech and the sample speech, parameters of the wake-up speech model are updated.

In the embodiment of the application, the server directly updates the parameters of the wake-up voice model based on the difference between the sample voice and the predicted voice generated by the wake-up voice model by comparing the sample voice with the predicted voice generated by the wake-up voice model, so that the wake-up voice model can be trained more simply, conveniently and efficiently.

In some embodiments, referring to fig. 8, fig. 8 is a schematic flow chart of an alternative wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application, based on fig. 4, may further be executed:

in step 301, the server obtains a wake-up voice carrying a first class label and a negative case voice carrying a second class label. The first classification label indicates that the wake-up voice is matched with the wake-up text, and the second classification label indicates that the negative example voice is not matched with the wake-up text.

Step 302, a training sample set is constructed based on wake-up voice carrying a first classification label and negative-example voice carrying a second classification label, and a voice classification model is trained based on the training sample set. The voice classification model is used for classifying input voices to be matched and outputting classification results of whether the input voices to be matched are matched with the wake-up text or not.

Referring to fig. 9, fig. 9 is an alternative structural schematic diagram of a speech classification model according to an embodiment of the present application. In practical implementation, after obtaining the wake-up voice and the negative example voice, the server performs frequency spectrum conversion on the wake-up voice and the negative example voice respectively to obtain corresponding wake-up voice frequency spectrum and negative example voice frequency spectrum, and adds a first classification label to the wake-up voice frequency spectrum and a second classification label to the negative example voice frequency spectrum. And then constructing a training sample set based on the wake-up voice spectrum carrying the first classification label and the negative example voice spectrum carrying the second classification label, and training a voice classification model through the training sample set. The voice classification model is a classification model, the spectrum features of the wake-up voice spectrum and the negative example voice spectrum are respectively extracted through a convolution layer, the wake-up voice spectrum and the negative example voice spectrum are classified through a full-connection layer after pooling treatment of a pooling layer, the output of the full-connection layer is mapped into a final classification result through a softmax function of an input layer, the classification result corresponding to the wake-up voice and the classification result corresponding to the negative example voice are obtained, and then parameters of each layer of the voice classification model are updated based on the classification result corresponding to the wake-up voice and the classification result corresponding to the negative example voice and the difference between the first classification label and the second classification label so as to train the voice classification model.

In practical implementation, after the training of the voice classification model is completed, the server responds to a voice matching request aiming at the voice to be matched, inputs the voice to be matched into the voice classification model, classifies the voice to be matched through the language classification model to obtain a corresponding classification result, then sends the classification result to the terminal, when the classification result indicates that the voice to be matched is matched with the awakening text, the terminal awakens the target program, and when the classification result indicates that the voice to be matched is not matched with the awakening text, the terminal does not awaken the target program.

In some embodiments, the server also performs robustness enhancement on the wake-up speech, and trains the wake-up speech model by the robust-enhanced wake-up speech. Here, the robustness enhancement may be implemented by adding noise to the wake-up speech, or performing far-field synthesis on the wake-up speech to increase a far-field speech effect on the wake-up speech, and adding noise to the wake-up speech or performing far-field synthesis to increase the robustness of the wake-up speech model obtained by training the wake-up speech.

In some embodiments, the server further obtains an environmental voice carrying a third classification tag, constructs a training sample set based on the wake-up voice carrying the first classification tag, the negative example voice carrying the second classification tag, and the environmental voice carrying the third classification tag, and trains a voice classification model based on the training sample set. Here, the speech classification model is a three-classification model, and the three predicted classification results include: matching wake-up speech, matching negative example speech, and matching ambient speech. After classifying the voice to be matched to obtain a classification result, the server sends the classification result to the terminal, and when the classification result is that the voice to be matched is matched with the wake-up voice, the terminal wakes up the target program, otherwise, the terminal does not wake up the target program. In the embodiment of the application, the environment voice is introduced to match the voice to be matched, so that the false wake rate of the target program is further reduced.

In some embodiments, the server further obtains a small amount of manually recorded wake-up voices and negative example voices, and constructs a training sample set based on the manually recorded wake-up voices and negative example voices and the wake-up voices and negative example voices synthesized by the embodiment of the application, so as to train the voice classification model. The wake-up voice model is trained by adding a small amount of manual recording voice, and on the premise of using a small amount of manual labor cost, the anthropomorphic effect of the voice generated by the wake-up voice model is further improved.

The application wake-up method provided by the embodiment of the present application is further described below, in practical application, the application wake-up method provided by the embodiment of the present application may be implemented in a terminal, or may be implemented in a server, or implemented by the terminal and the server cooperatively, and the application wake-up method provided by the embodiment of the present application will be described below in connection with exemplary application and implementation of the server provided by the embodiment of the present application. Referring to fig. 10, fig. 10 is a schematic flowchart of an alternative method for waking up an application according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 10.

In step 401, the server receives an application wake-up instruction carrying a voice to be matched, where the application wake-up instruction is used to instruct a wake-up target program.

In actual implementation, the terminal can collect the voice to be matched through a microphone arranged in the terminal, and when the voice to be matched is collected, an application awakening instruction carrying the voice to be matched is generated, and the application awakening instruction is sent to the server. In some embodiments, the terminal may also collect the voice to be matched through an external microphone in communication with the terminal.

And step 402, in response to the application wake-up instruction, matching the voice to be matched with the wake-up voice. The wake-up voice is obtained by voice synthesis based on predicted voice parameters, and the predicted voice parameters are obtained by voice parameter prediction based on voices of different users and wake-up texts through a wake-up voice model.

After receiving the application wake-up instruction, the server responds to the application wake-up instruction to match the voice to be matched with the wake-up voice stored locally in the server. Here, the server may directly match the voice to be matched with the wake-up voice in a hard matching manner, and may also predict the matching degree of the voice to be matched with the wake-up voice by adopting the voice classification model provided by the embodiment of the present application.

In some embodiments, the server locally stores negative example voices and environment voices, and the server sequentially matches the voices to be matched with all voices stored locally to obtain matching results of the voices to be matched and the voices.

Step 403, waking up the target program when the voice to be matched is successfully matched with the wake-up voice.

When the voice to be matched is successfully matched with the awakening voice, the target program is awakened, if the voice to be matched is unsuccessful in matching with the awakening voice, the target program is not awakened, or when the voice to be matched is successful in matching with the negative example voice or the environment voice, the target program is not awakened.

In the embodiment of the application, the voice to be matched is matched with the wake-up voice obtained by voice synthesis based on the predicted voice parameters, when the matching is successful, the target program is awakened, in the process of awakening the target program, only voice matching is needed without voice recognition, the calculated amount of the equipment is low, the wake-up voice is obtained by voice synthesis based on the predicted voice parameters, the predicted voice parameters are obtained by voice parameter prediction based on voice and wake-up text through the wake-up voice model, manual recording is not needed, and time and labor cost are saved.

Next, the wake-up speech synthesis method based on the wake-up speech model provided by the embodiment of the application is continuously described, and the wake-up speech synthesis method based on the wake-up speech model provided by the embodiment of the application is cooperatively implemented by the terminal and the server. Referring to fig. 11, fig. 11 is an optional flowchart of a wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application, where the wake-up speech synthesis method based on a wake-up speech model according to an embodiment of the present application includes:

Step 501, the terminal presents a wake-up text setting interface, and presents a setting function item for setting the wake-up text on the wake-up text setting interface.

In actual implementation, a user can send out a starting operation for starting the wake-up text setting interface through a man-machine interaction interface of the terminal, and the terminal responds to the starting operation triggered by the user to present the wake-up text setting interface. Illustratively, referring to fig. 12, fig. 12 is an alternative schematic diagram of a wake text setting interface provided by an embodiment of the present application. The wake-up text setting interface is internally provided with a setting function item of wake-up text for at least one application, for example, five applications from application 1 to application 5 in the illustrated mode.

Step 502, the terminal responds to the setting operation triggered based on the setting function item to obtain a wake-up text.

It should be noted that, the wake-up text setting interface is internally provided with a setting function item for the wake-up text, the terminal responds to the setting operation triggered by the setting function item, presents the wake-up text input interface, presents the wake-up text input function item in the wake-up text input interface, and responds to the triggering operation for the wake-up text input function item, and acquires the input wake-up text. Illustratively, referring to fig. 13, fig. 13 is an alternative schematic diagram of a wake text input interface provided by an embodiment of the present application. The user triggers the setting operation for "application 1" by clicking the setting function item for "application 1" in the wake-up text setting interface, and the terminal presents the wake-up text input interface for "application 1" in response to the setting operation, and presents the wake-up text input function item 130 on the wake-up text input interface. The user can input the corresponding wake-up text based on the wake-up text input function item, such as "hello, xiaoqi" and the like as shown in fig. 13. And the terminal responds to the input operation aiming at the wake-up text input function item to acquire the input wake-up text.

In some embodiments, the terminal may further present a selection function item of a plurality of wake texts within the wake text input interface, and the terminal obtains the selected wake text in response to a trigger operation for the selection function item. Here, one selection function corresponds to one wake-up text. In actual implementation, the wake-up text may be set correspondingly for different applications, and one or more wake-up texts may be set correspondingly for the same application. For example, referring to fig. 13, the terminal further presents a new additional function item 131 for the wake text in the wake text input interface, and the user may trigger a new operation on the wake text input function item by clicking the new additional function item of the wake text, and the terminal adds a new wake text input function item 132 in the wake text input interface in response to the new operation.

The application here includes all programs in the terminal, such as the terminal's screen wake-up program, i.e. the terminal may be woken up from a standby black screen state to a screen lit state by matching wake-up speech. For example, referring to fig. 14, fig. 14 is an optional schematic diagram of a wake-up interface of a target program provided by an embodiment of the present application, where the target program is an intelligent robot program, and when a voice to be matched matches with a wake-up text, a terminal presents a man-machine interaction interface presented by the intelligent robot as a wake-up interface of the target program.

In step 503, the terminal sends a wake-up text to the server.

Step 504, the server obtains voices of different users, and performs voice synthesis based on the voices of different users and the wake-up text, and obtains corresponding wake-up voices and negative example voices through the wake-up voice model.

In practical implementation, after receiving a wake-up text sent by a terminal, a server triggers a voice synthesis instruction of wake-up voice, and in response to the voice synthesis instruction, a user voice library is crawled from a webpage to obtain a large number of user voices. Then, the server synthesizes the wake-up voice and the negative example voice based on the user voice and the wake-up text, respectively. Here, the synthesis process of the server to the wake-up voice and the negative case voice refers to the wake-up voice synthesis method based on the wake-up voice model provided in the foregoing embodiment of the present application, and will not be described herein again.

In step 505, the server constructs a training sample set based on the wake-up speech and the negative example speech, and trains a speech classification model based on the training sample set.

In some embodiments, the server also obtains ambient speech, builds a training sample set training speech classification model based on the ambient speech, the wake speech, and the negative example speech. Here, for the wake-up speech synthesis process based on the wake-up speech model, reference is made to the wake-up speech synthesis method based on the wake-up speech model provided in the above embodiment of the present application, and details thereof are not repeated here.

In step 506, the server sends the speech classification model to the terminal.

In step 507, the terminal collects the voice to be matched, and performs voice classification on the voice to be matched through a voice classification model so as to match the voice to be matched with the wake-up voice.

In practical implementation, the terminal collects voice in real time based on the microphone inside the terminal, and after collecting voice to be matched, the voice to be matched is subjected to voice classification through a voice classification model so as to be matched with wake-up voice.

Step 508, when the voice to be matched is successfully matched with the wake-up voice, waking up the target program.

In actual implementation, when the classification result of the voice classification model indicates that the voice to be matched is matched with the target wake-up text, the voice to be matched is successfully matched with the wake-up voice, and the terminal wakes up the target program corresponding to the target wake-up text. For example, if the target program is a screen wake-up program of the terminal, the terminal wakes up the screen from a black screen state to a lit state. If the target program is the intelligent robot application, the terminal wakes up the intelligent robot application, so that the terminal is switched from the current using interface to the man-machine interaction interface of the intelligent robot application. The embodiment of the application does not specifically limit the type of the target program and the form of waking up the target program.

Continuing with the following description of an exemplary structure of the wake-up speech synthesis device 555 implemented as a software module according to the embodiment of the present application, in some embodiments, as shown in fig. 15, fig. 15 is a schematic structural diagram of an optional wake-up device according to the embodiment of the present application, where the wake-up speech model includes a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, and the software module stored in the wake-up speech synthesis device 555 based on a wake-up speech model of the memory 540 may include:

the voiceprint extraction module 5551 is configured to perform voiceprint feature extraction on voices of different users through the voiceprint extraction layer, so as to obtain corresponding user voiceprints;

The phoneme conversion module 5552 is configured to perform phoneme conversion on the wake-up text through the phoneme conversion layer to obtain a corresponding wake-up phoneme sequence;

A prediction module 5553, configured to predict a speech parameter through the prediction layer based on the user voiceprint and the wake-up phoneme sequence, so as to obtain a corresponding predicted speech parameter;

The speech synthesis module 5554 is configured to perform speech synthesis based on the predicted speech parameters, so as to obtain a corresponding wake-up speech;

In some embodiments, the speech synthesis module 5554 is further configured to perform a spectrum conversion based on the predicted speech parameters to obtain a corresponding speech spectrum; and performing voice coding based on the voice frequency spectrum to obtain corresponding wake-up voice.

In some embodiments, the wake speech synthesis apparatus based on the wake speech model further comprises: the voice matching module is used for acquiring a negative example phoneme sequence which is not matched with the awakening text; based on the negative example phoneme sequence, performing voice synthesis to obtain negative example voice corresponding to the awakening text; responding to a voice matching request aiming at voice to be matched, and respectively matching the voice to be matched with the wake-up voice and the negative example voice to obtain a matching result; and sending the matching result so as not to wake up the target program when the matching result characterizes that the voice to be matched is successfully matched with the negative example voice.

In some embodiments, the phoneme conversion module is further configured to obtain at least one of a re-reading parameter and a pause parameter as the pronunciation parameter; and carrying out phoneme conversion on the wake-up text through the phoneme conversion layer based on the pronunciation parameters to obtain a corresponding wake-up phoneme sequence.

In some embodiments, the wake speech synthesis apparatus based on the wake speech model further comprises: the generalization module is used for acquiring at least one of a speech speed parameter, a pitch parameter and a volume parameter as a speech generalization parameter; and performing voice generalization processing on the wake-up voice based on the voice generalization parameters to obtain the generalized wake-up voice.

In some embodiments, the wake speech synthesis apparatus based on the wake speech model further comprises: the voice matching module is used for receiving a voice matching request aiming at the voice to be matched; responding to the voice matching request, and comparing the waveform characteristics of the voice to be matched with the waveform characteristics of the wake-up voice to determine the similarity of the voice to be matched and the wake-up voice; and sending the determined similarity to wake up the target program when the similarity reaches a similarity threshold.

In some embodiments, the wake speech synthesis apparatus based on the wake speech model further comprises: the voice classification model training module is used for acquiring wake-up voice carrying a first classification label and negative-example voice carrying a second classification label; the first classification label indicates that the wake-up voice is matched with the wake-up text, and the second classification label indicates that the negative example voice is not matched with the wake-up text; constructing a training sample set based on wake-up voice carrying a first classification label and negative-example voice carrying a second classification label, and training a voice classification model based on the training sample set; the voice classification model is used for classifying input voices to be matched and outputting classification results of whether the input voices to be matched are matched with the wake-up text or not.

In some embodiments, the wake-up speech model further includes a speech synthesis layer, and the speech synthesis module 5554 is further configured to perform speech synthesis through the speech synthesis layer based on the predicted speech parameters, so as to obtain a corresponding wake-up speech.

In some embodiments, the wake-up voice model training module is configured to perform voiceprint feature extraction on the sample voice through the voiceprint extraction layer to obtain a corresponding sample voiceprint; wherein the sample voice carries a voice parameter tag; performing phoneme conversion on the sample text corresponding to the sample speech through the phoneme conversion layer to obtain a corresponding sample phoneme sequence; based on the sample voiceprint and the sample phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; updating model parameters of the wake-up speech model based on differences between the predicted speech parameters and the speech parameter tags.

Continuing to describe the exemplary structure of the application wake-up device provided by the embodiment of the present application implemented as a software module, referring to fig. 16, fig. 16 is a schematic diagram of an alternative structure of the application wake-up device provided by the embodiment of the present application, where the application wake-up device 16 provided by the embodiment of the present application includes:

A receiving module 161, configured to receive an application wake-up instruction carrying a voice to be matched, where the application wake-up instruction is used to instruct to wake up a target program;

The matching module 162 is configured to match the voice to be matched with the wake-up voice in response to the application wake-up instruction;

A wake-up module 163, configured to wake up the target program when the voice to be matched is successfully matched with the wake-up voice.

It should be noted that, the description of the apparatus according to the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted.

The embodiment of the application provides a computer program product, which comprises a computer program and is characterized in that the computer program is executed by a processor to realize the wake-up voice synthesis or the wake-up application method based on the wake-up voice model.

Embodiments of the present application provide a computer readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, wake-up speech synthesis based on a wake-up speech model or applying a wake-up method as shown in fig. 3.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, through the embodiment of the application, the wake-up voice with the anthropomorphic effect can be efficiently generated, and the time and labor cost are saved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. The wake-up voice synthesis method based on the wake-up voice model is characterized in that the wake-up voice model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, and the method comprises the following steps:

performing phoneme conversion on the wake-up text through the phoneme conversion layer based on pronunciation parameters to obtain a corresponding wake-up phoneme sequence, wherein the pronunciation parameters at least comprise one of rereading parameters and pause parameters;

based on the user voiceprint and the wake-up phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters; the method comprises the steps that the number of user voices comprises a plurality of voice print characteristics are extracted for each user voice to obtain a plurality of corresponding user voice prints, and voice parameter prediction is carried out for each user voice print by combining wake-up voice sequences corresponding to each group of pronunciation parameters to obtain a corresponding group of predicted voice parameters; the predicted voice parameters comprise duration, pitch and rhythm;

Performing voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice, wherein the wake-up voice is voice combining pronunciation parameters aiming at wake-up text;

2. The method of claim 1, wherein performing speech synthesis based on the predicted speech parameters to obtain corresponding wake-up speech comprises:

Performing frequency spectrum conversion based on the predicted voice parameters to obtain corresponding voice frequency spectrum;

and performing voice coding based on the voice frequency spectrum to obtain corresponding wake-up voice.

3. The method according to claim 1, wherein the method further comprises: acquiring a negative example phoneme sequence which is not matched with the awakening text;

based on the negative example phoneme sequence, performing voice synthesis to obtain negative example voice corresponding to the awakening text;

responding to a voice matching request aiming at voice to be matched, and respectively matching the voice to be matched with the wake-up voice and the negative example voice to obtain a matching result;

and sending the matching result so as not to wake up the target program when the matching result characterizes that the voice to be matched is successfully matched with the negative example voice.

4. The method according to claim 1, wherein the method further comprises:

Acquiring at least one of a speech rate parameter, a pitch parameter and a volume parameter as a speech generalization parameter;

And performing voice generalization processing on the wake-up voice based on the voice generalization parameters to obtain the generalized wake-up voice.

5. The method according to claim 1, wherein the method further comprises:

Receiving a voice matching request aiming at voices to be matched;

Responding to the voice matching request, and comparing the waveform characteristics of the voice to be matched with the waveform characteristics of the wake-up voice to determine the similarity of the voice to be matched and the wake-up voice;

and sending the determined similarity to wake up the target program when the similarity reaches a similarity threshold.

6. The method according to claim 1, wherein the method further comprises:

Obtaining wake-up voice carrying a first classification label and negative example voice carrying a second classification label;

the first classification label indicates that the wake-up voice is matched with the wake-up text, and the second classification label indicates that the negative example voice is not matched with the wake-up text;

Constructing a training sample set based on wake-up voice carrying a first classification label and negative-example voice carrying a second classification label, and training a voice classification model based on the training sample set;

the voice classification model is used for classifying input voices to be matched and outputting classification results of whether the input voices to be matched are matched with the wake-up text or not.

7. The method of claim 1, wherein the wake-up speech model further comprises a speech synthesis layer, wherein the performing speech synthesis based on the predicted speech parameters to obtain corresponding wake-up speech comprises:

and based on the predicted voice parameters, performing voice synthesis through the voice synthesis layer to obtain corresponding wake-up voice.

8. The method of claim 1, wherein prior to voiceprint feature extraction of user speech by the voiceprint extraction layer, the method further comprises:

performing voiceprint feature extraction on the sample voice through the voiceprint extraction layer to obtain a corresponding sample voiceprint; wherein the sample voice carries a voice parameter tag;

performing phoneme conversion on the sample text corresponding to the sample speech through the phoneme conversion layer to obtain a corresponding sample phoneme sequence;

Based on the sample voiceprint and the sample phoneme sequence, carrying out voice parameter prediction through the prediction layer to obtain corresponding predicted voice parameters;

Updating model parameters of the wake-up speech model based on differences between the predicted speech parameters and the speech parameter tags.

9. An application wake-up method, the method comprising:

receiving an application wake-up instruction carrying a voice to be matched, wherein the application wake-up instruction is used for indicating a wake-up target program; responding to the application wake-up instruction, and matching the voice to be matched with wake-up voice;

When the voice to be matched is successfully matched with the awakening voice, awakening the target program;

the wake-up speech model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, wherein

The voiceprint extraction layer is used for extracting voiceprint characteristics of voices of different users to obtain corresponding user voiceprints; the phoneme conversion layer is used for performing phoneme conversion on the awakened text based on pronunciation parameters to obtain a corresponding awakened phoneme sequence, wherein the pronunciation parameters at least comprise one of rereading parameters and pause parameters;

The prediction layer is used for predicting voice parameters based on the user voiceprint and the wake-up phoneme sequence to obtain corresponding predicted voice parameters; the method comprises the steps that the number of user voices comprises a plurality of voice print characteristics are extracted for each user voice to obtain a plurality of corresponding user voice prints, and voice parameter prediction is carried out for each user voice print by combining wake-up voice sequences corresponding to each group of pronunciation parameters to obtain a corresponding group of predicted voice parameters; the predicted voice parameters comprise duration, pitch and rhythm;

the prediction layer is also used for carrying out voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice; the wake-up voice is used for waking up the target program when the matching with the voice to be matched is successful, and the wake-up voice is voice combining pronunciation parameters aiming at wake-up text.

10. A wake-up speech synthesis apparatus based on a wake-up speech model, wherein the wake-up speech model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, the apparatus comprising:

The voiceprint extraction module is used for extracting voiceprint characteristics of voices of different users through the voiceprint extraction layer to obtain corresponding user voiceprints;

The phoneme conversion module is used for carrying out phoneme conversion on the awakened text through the phoneme conversion layer based on pronunciation parameters to obtain a corresponding awakened phoneme sequence, wherein the pronunciation parameters at least comprise one of rereading parameters and pause parameters;

the prediction module is used for predicting voice parameters based on the user voiceprint and the wake-up phoneme sequence through the prediction layer to obtain corresponding predicted voice parameters; the method comprises the steps that the number of user voices comprises a plurality of voice print characteristics are extracted for each user voice to obtain a plurality of corresponding user voice prints, and voice parameter prediction is carried out for each user voice print by combining wake-up voice sequences corresponding to each group of pronunciation parameters to obtain a corresponding group of predicted voice parameters; the predicted voice parameters comprise duration, pitch and rhythm;

The voice synthesis module is used for carrying out voice synthesis based on the predicted voice parameters to obtain corresponding wake-up voice; the wake-up voice is used for waking up the target program when the matching with the voice to be matched is successful, and the wake-up voice is voice combining pronunciation parameters aiming at wake-up text.

11. An application wake-up device, the device comprising:

The matching module is used for responding to the application wake-up instruction and matching the voice to be matched with the wake-up voice; the wake-up voice is obtained by voice synthesis based on predicted voice parameters, and the predicted voice parameters are obtained by voice parameter prediction based on voices of different users and wake-up texts through a wake-up voice model; the method comprises the steps that the number of user voices comprises a plurality of voice print characteristics are extracted for each user voice to obtain a plurality of corresponding user voice prints, and voice parameter prediction is carried out for each user voice print by combining wake-up voice sequences corresponding to each group of pronunciation parameters to obtain a corresponding group of predicted voice parameters; the predicted voice parameters comprise duration, pitch and rhythm;

The wake-up module is used for waking up the target program when the voice to be matched is successfully matched with the wake-up voice, wherein the wake-up voice is voice combined with pronunciation parameters aiming at wake-up text;

The wake-up voice model comprises a voiceprint extraction layer, a phoneme conversion layer and a prediction layer, wherein the phoneme conversion layer is used for carrying out phoneme conversion on a wake-up text through the phoneme conversion layer based on pronunciation parameters, so as to obtain a corresponding wake-up phoneme sequence, and the pronunciation parameters at least comprise one of rereading parameters and pause parameters.

12. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 9 when executing executable instructions stored in said memory.

13. A computer readable storage medium storing executable instructions for implementing the method of any one of claims 1 to 9 when executed by a processor.

14. A computer program product storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 9.