CN110534117B

CN110534117B - Method, apparatus, device and computer medium for optimizing a speech generation model

Info

Publication number: CN110534117B
Application number: CN201910853333.1A
Authority: CN
Inventors: 欧阳能钧
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-11-25
Anticipated expiration: 2039-09-10
Also published as: CN110534117A

Abstract

Methods, apparatus, devices, and computer media for optimizing a speech generation model are disclosed. One embodiment of the method comprises: acquiring a voice instruction sent by a user during voice interaction between the user and the intelligent equipment; extracting a voiceprint characteristic value of a user from the voice instruction, and adding the voiceprint characteristic value into the characteristic value set; and if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value, optimizing the existing voice generation model based on the voiceprint characteristic values in the characteristic value set. The implementation mode can update the voice generation model iteratively without collecting voiceprints of the user specially, so that the voice broadcasted by the intelligent device is more and more similar to the voice broadcasted by the user.

Description

Method, apparatus, device and computer medium for optimizing a speech generation model

Technical Field

Embodiments of the present application relate to the field of computer technology, and in particular, to a method, apparatus, device, and computer medium for optimizing a speech generation model.

Background

When a traditional vehicle-mounted voice system carries out voice broadcasting, voice broadcasting roles are specific voice roles which are already built in before delivery. Thus, when the in-vehicle voice system is interacting with the user, the tone of the voice announcement is fixed, and the system will communicate with the user in a fixed sound (e.g., a standard sound, or the sound of the XX actor).

Conventional in-vehicle voice systems often bring a constant experience to the user.

Disclosure of Invention

Embodiments of the present application propose methods, apparatuses, electronic devices and computer-readable media for optimizing a speech generation model.

In a first aspect, an embodiment of the present application provides a method for optimizing a speech generation model, where the method includes: acquiring a voice instruction sent by a user during voice interaction between the user and the intelligent equipment; extracting a voiceprint characteristic value of a user from the voice instruction, and adding the voiceprint characteristic value into the characteristic value set; and if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value, optimizing the existing voice generation model based on the voiceprint characteristic values in the characteristic value set.

In some embodiments, extracting the user's voiceprint feature values from the voice instructions comprises: judging whether the voice instruction meets the preset semantic requirement or not; and if the voice command meets the preset semantic requirement, extracting a voiceprint characteristic value from the voice command.

In some embodiments, determining whether the voice command satisfies the predetermined semantic requirement includes: converting the voice instruction into a text instruction; performing semantic analysis on the text instruction to obtain a semantic analysis result; and determining whether the voice command meets the preset semantic requirement based on the semantic analysis result.

In some embodiments, obtaining a voice instruction uttered by a user comprises: acquiring audio data received by a voice input device of the intelligent device; and cleaning the audio data and eliminating the non-human voice audio data.

In some embodiments, optimizing an existing speech generation model based on voiceprint feature values in a set of feature values includes: vectorizing the voiceprint characteristic values in the characteristic value set to enable the voiceprint characteristic values to correspond to corresponding characters; and merging the vectorized voiceprint characteristic value into a model dictionary of the existing voice generation model for incremental fitting to obtain the optimized voice generation model.

In a second aspect, an embodiment of the present application provides a method for broadcasting voice, where the method includes: acquiring a voice instruction sent by a user and converting the voice instruction into a text instruction; inputting a text instruction into a voice generation model optimized by the method described by any implementation mode in the first aspect to obtain voice content to be broadcasted; and broadcasting the voice content.

In a third aspect, an embodiment of the present application provides an apparatus for optimizing a speech generation model, where the apparatus includes: the voice acquisition unit is configured to acquire a voice instruction sent by a user during voice interaction between the user and the intelligent equipment; the feature extraction unit is configured to extract a voiceprint feature value of the user from the voice instruction and add the voiceprint feature value into the feature value set; and the optimization unit is configured to optimize the existing voice generation model based on the voiceprint characteristic values in the characteristic value set if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value.

In some embodiments, the feature extraction unit comprises: the semantic judging module is configured to judge whether the voice instruction meets preset semantic requirements; and the characteristic extraction module is configured to extract a voiceprint characteristic value from the voice instruction if the voice instruction meets the preset semantic requirement.

In some embodiments, the semantic determination module comprises: a text conversion module configured to convert the voice instruction into a text instruction; the semantic analysis module is configured to perform semantic analysis on the text instruction to obtain a semantic analysis result; a determining module configured to determine whether the voice instruction satisfies a preset semantic requirement based on the semantic analysis result.

In some embodiments, the voice acquisition unit includes: an audio acquisition module configured to acquire audio data received by a voice input device of a smart device; and the cleaning module is configured to clean the audio data and eliminate the non-human voice audio data.

In some embodiments, the optimization unit comprises: the vectorization module is configured to vectorize the voiceprint characteristic values in the characteristic value set so that the voiceprint characteristic values correspond to corresponding characters; and the fitting module is configured to incorporate the vectorized voiceprint characteristic value into a model dictionary of the existing voice generation model for incremental fitting to obtain the optimized voice generation model.

In a fourth aspect, an embodiment of the present application provides a device for broadcasting a voice, where the device includes: an instruction acquisition unit configured to acquire a voice instruction issued by a user and convert the voice instruction into a text instruction; the voice generating unit is configured to input the text instruction into a voice generating model optimized by the method described in any one of the implementation manners of the first aspect, so as to obtain voice content to be broadcasted; and an announcement unit configured to announce the voice content.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first or second aspects.

In a sixth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect or the second aspect.

According to the method, the device, the electronic equipment and the computer readable medium for optimizing the voice generation model, the voice instruction sent by the user is acquired during the interaction between the user and the intelligent equipment, then the voiceprint characteristic value of the user is extracted from the voice instruction, and the existing voice generation model is optimized by using the collected voiceprint characteristic value when the number of the collected voiceprint characteristic values reaches the preset number, so that the voice generation model can be updated iteratively without collecting the voiceprint of the user specially, and the voice broadcasted by the intelligent equipment is closer to the user.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for optimizing a speech generation model according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for optimizing a speech generating model according to the present application;

FIG. 4 is a flow diagram of one embodiment of a method for broadcasting voice in accordance with the present application;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for optimizing a speech generation model according to the present application;

fig. 6 is a schematic structural diagram of one embodiment of a device for broadcasting voice according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the method for optimizing a speech generation model, the method for announcing speech, the apparatus for optimizing a speech generation model, or the apparatus for announcing speech of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

smart devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium to provide communication links between the

smart devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may operate the

smart devices

101, 102, 103 in a natural language dialogue to interact with the server 105 over the network 104 to receive or send messages or the like. The

smart devices

101, 102, 103 may have various communication client applications installed thereon, such as a voice-type application, a shopping-type application, a search-type application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

smart devices

101, 102, 103 may be hardware or software. When the

smart devices

101, 102, 103 are hardware, they may be various electronic devices supporting conversational voice interaction, including but not limited to a car-mounted voice terminal, a smart speaker, a smart refrigerator, a smart television, a smart phone, a tablet computer, and so on. When the

smart devices

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as a plurality of software or software modules (for example to provide distributed services) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for speech-like applications running on the

smart devices

101, 102, 103. The server 105 may analyze and otherwise process the received data such as the voice command, and feed back the processing result (e.g., the voice content to be broadcasted) to the

smart devices

101, 102, 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for optimizing the speech generation model provided in the embodiment of the present application may be executed by the server 105, and accordingly, the apparatus for optimizing the speech generation model may be disposed in the server 105. The method for broadcasting the voice provided by the embodiment of the application can be executed by the

intelligent devices

101, 102, and 103, and accordingly, the device for broadcasting the voice is generally arranged in the

intelligent devices

101, 102, and 103.

It should be understood that the number of intelligent devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of smart devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for optimizing a speech generation model according to the present application is shown. The method for optimizing a speech generating model may comprise the steps of:

step 201, during the period of voice interaction between the user and the intelligent device, acquiring a voice instruction sent by the user.

In this embodiment, the user may have voice interaction with the smart device in a natural language dialog. For example, when the user is an owner of the vehicle and the intelligent device is a vehicle-mounted voice terminal, the owner of the vehicle can send voice instructions such as "turn on an air conditioner", "play road condition information", "answer a call" and the like to the vehicle-mounted voice terminal. For another example, when the smart device is a smart speaker, the user may talk to the smart speaker to inquire about weather, temperature, or play a game.

An execution agent (e.g., server 105 of FIG. 1) on which the method for optimizing speech generation models operates may obtain speech instructions uttered by a user while the user interacts with the smart device. For example, when the vehicle owner speaks an instruction of turning on the air conditioner to the vehicle-mounted voice terminal, the server may obtain the voice instruction of turning on the air conditioner from data submitted by the vehicle-mounted voice terminal through a wired connection manner or a wireless connection manner.

Here, the voice instruction may include an instruction issued in a natural language dialogue manner, for example, "how is today the weather? "," turn on air conditioner "," play traffic art broadcast ", and so on. Natural language generally refers to a language that naturally evolves with culture, e.g., chinese, english, japanese, etc.

It should be noted that the above-mentioned Wireless connection manner may include, but is not limited to, a 3G (the 3rd generation)/4G (the 4th generation, the fourth generation)/5G (the 5th generation, the fifth generation) communication connection, a Wi-Fi (Wireless-Fidelity) connection, a bluetooth connection, a WiMAX (Worldwide Interoperability for Microwave Access) connection, a Zigbee (also called Zigbee protocol) connection, an UWB (Ultra wide band) connection, and other Wireless connection manners now known or developed in the future.

In some optional implementation manners of this embodiment, step 201 may specifically include:

first, audio data collected by a voice input device (e.g., microphone, etc.) of the smart device is acquired.

And then, cleaning the audio data and eliminating the non-human voice audio data. Here, the non-human voice may mean a sound not emitted by a human, for example, a door opening sound, a car noise, or the like.

Through cleaning the voice data collected by the intelligent equipment, the voiceprint characteristic value of the user extracted subsequently can be more accurate, and the optimization effect of the voice generation model is better.

Step 202, extracting the voiceprint characteristic value of the user from the voice command, and adding the voiceprint characteristic value into the characteristic value set.

In this embodiment, an execution subject (e.g., server 105 of fig. 1) on which the method for optimizing a speech generation model operates may extract a voiceprint feature value of a user (e.g., a car owner) from the speech instruction received in step 201 and add the extracted voiceprint feature value to a set of feature values (which may also be referred to as a feature pool). Voiceprints are the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has specificity, but also has the characteristic of relative stability. Here, the voiceprint feature value may be a biometric value used to uniquely identify the user, for example, a Mel Frequency Cepstrum Coefficient (MFCC) value.

In some optional implementations of the present embodiment, the voiceprint feature value of the user may be extracted through spectrum analysis or using a preset filter.

In some optional implementations of this embodiment, step 202 may specifically include the following steps:

first, it is determined whether the voice command acquired in step 201 meets a preset semantic requirement. For example, it is determined whether the voice command has a clear semantic meaning, or has a null semantic meaning, an error semantic meaning, or the like.

And then, if the voice command meets the preset semantic requirement, extracting the voiceprint characteristic value of the user from the voice command.

Optionally, the following manner may be adopted to determine whether the voice instruction acquired in step 201 meets the preset semantic requirement:

first, the voice command obtained in step 210 is converted into a text command. For example, ASR (Automatic Speech Recognition), a technique of converting voice into text, may be performed on the voice command to obtain a command in text form.

And secondly, performing semantic analysis on the text instruction to obtain a semantic analysis result. For example, a text instruction recognized by the ASR is subjected to NLP (Natural Language Processing) semantic understanding to obtain a semantic understanding result of the text instruction.

Thirdly, determining whether the voice command acquired in step 201 meets the preset semantic requirement by using the semantic analysis result.

The explanation is given by taking the example of the voice instruction 'small, help me turn on the air conditioner'. Firstly, a text instruction 'small degree' is obtained through ASR recognition, and the air conditioner is turned on by the aid of I. Then, NLP semantic understanding is carried out on the text instruction of 'small degree, help me to open the air conditioner', and a semantic analysis result of 'domain: system _ control, intent: open, param: air _ controller' is obtained. From the semantic analysis results, the voice command 'small degree, help me open the air conditioner' has clear semantics and meets the preset semantic requirements.

By extracting the voiceprint features when the voice command meets the preset semantic requirement, the quality of the voiceprint feature value can be prevented from being reduced due to invalid semantics such as null semantics, and the quality of the extracted voiceprint feature value is ensured.

And 203, if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value, optimizing the existing voice generation model based on the voiceprint characteristic values in the characteristic value set.

In this embodiment, an executing agent (e.g., server 105 of fig. 1) on which the method for optimizing a speech generation model operates may determine whether the number of voiceprint feature values in the set of feature values reaches a preset threshold, and optimize an existing speech generation model using the voiceprint feature values in the set of feature values when the number of voiceprint feature values in the set of feature values reaches the preset threshold. As an example, when the number of voiceprint feature values in the set of feature values reaches 100, the existing speech generation model is optimized using the 100 voiceprint feature values.

Here, the existing Speech generation model may be a TTS (Text To Speech, from Text To Speech) model. The existing voice generation model can be obtained by using characters as input, using a voiceprint characteristic value of a user as output and training by adopting a machine learning method. When there is no existing speech generating model (for example, when the user uses the smart device for the first time or soon begins to use the smart device), the executing entity may use the voiceprint feature value (and the text corresponding to the voiceprint feature value) in the feature value set as a training sample, and train to obtain the speech generating model as an optimized speech generating model.

The preset threshold may be a preset value according to actual needs, for example, 500, 10000, etc. In this embodiment, the iterative update may be performed by a small step fast running and fast iteration, in which case, the preset threshold may be a relatively small value, for example, 100, 150, or the like.

In some optional implementation manners of this embodiment, step 203 may specifically include: vectorizing the voiceprint characteristic values in the characteristic value set to enable the voiceprint characteristic values to correspond to corresponding characters; and merging the vectorized voiceprint characteristic value into a model dictionary of the existing voice generation model for incremental fitting to obtain the optimized voice generation model.

Here, the model dictionary of the existing speech generating model may be a model dictionary in which the existing speech generating model is trained.

By incorporating the voiceprint feature values in the feature value set into the existing model dictionary and performing incremental fitting, the finally obtained speech output by the model is closer and closer to the voice of a user (e.g., a car owner).

In this embodiment, when the intelligent device performs voice interaction with the user, the voice instruction of the user is inevitably submitted to the server to obtain the statement replying to the user, so the method for optimizing the voice generation model provided by this embodiment does not need to specially obtain the voice of the user, and as long as the user continuously uses the intelligent device, enough voiceprints of the user can be collected to perform iterative update.

Because the iterated speech generation model is closer to the sound of the user, the user can obtain good personalized experience by using the optimized speech generation model. And the optimized voice generation model can be transplanted, for example, a car owner can use the own voice role for the car owner and wife, and the use experience and satisfaction of a user are greatly improved.

With continued reference to FIG. 3, FIG. 3 is a schematic diagram of an application scenario of the method for optimizing a speech generation model according to the present embodiment. In the application scenario, the vehicle owner 301 sends a voice instruction 303 of "small degree help me open the air conditioner" to the vehicle-mounted terminal 302, the vehicle-mounted terminal 302 sends the voice instruction 303 to the server 304 for analysis, the air conditioner is controlled to be opened according to an analysis result, and the vehicle owner 301 is replied that "the air conditioner is opened". Thus, the server 304 acquires the voice instruction 303 issued by the user 301 during the above-described voice interaction. The server 304 can then perform ASR recognition and NLP understanding on the speech instructions 303, resulting in semantic analysis results 305. Since the semantic analysis result 305 has an explicit semantic meaning, it can be determined that the voice instruction 303 has a valid semantic meaning. The server 304 then extracts the MFCC value 306 for the owner 301 from the voice instruction 303 and adds it to the pool of features 307. When the number of the MFCC values in the feature pool 307 reaches 100, the server 304 may incorporate the MFCC values in the feature pool 307 into the existing model dictionary for incremental fitting, and finally obtain the optimized TTS model.

According to the method provided by the embodiment of the application, the voice command sent by the user is obtained during the interaction between the user and the intelligent device, then the voiceprint characteristic value of the user is extracted from the voice command, and the collected voiceprint characteristic value is utilized to optimize the existing voice generation model when the number of the collected voiceprint characteristic values reaches the preset number, so that the voice generation model can be updated iteratively without specially collecting the voiceprint of the user, and the voice broadcasted by the intelligent device is closer to the user.

Continuing to refer to fig. 4, a flow 400 of one embodiment of a method for broadcasting voice according to the present application is shown. The method for broadcasting voice can comprise the following steps:

step 401, acquiring a voice instruction sent by a user and converting the voice instruction into a text instruction.

In this embodiment, an execution subject (e.g., the

smart devices

101, 102, 103 of fig. 1) on which the method for broadcasting voice operates may acquire a voice instruction issued by a user in a natural language dialog manner and then convert the voice instruction into a text instruction.

Step 402, inputting the text command into the optimized voice generation model to obtain the voice content to be broadcasted.

In this embodiment, an execution subject (for example, the

smart device

101, 102, 103 in fig. 1) on which the method for broadcasting voice operates may input the text instruction obtained in step 401 into the optimized voice generation model, so as to obtain the voice content to be played. The optimized speech generation model may be issued to the execution subject after a server (e.g., the server 105 in fig. 1) optimizes an existing speech generation model in the manner described in the embodiment shown in fig. 2.

And step 403, broadcasting the voice content.

In this embodiment, the execution subject (e.g., the

smart device

101, 102, 103 of fig. 1) on which the method for broadcasting voice operates may play the voice content obtained in step 402.

According to the method provided by the embodiment of the application, the voice instruction of the user is obtained and converted into the text instruction, then the text instruction is input into the optimized voice generation model to obtain the content to be broadcasted, and finally the voice content is broadcasted, so that the broadcasted sound is closer to the user.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an apparatus for optimizing a speech generation model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied in a server.

As shown in fig. 5, the apparatus 500 for optimizing a speech generation model of the present embodiment may include a speech acquisition unit 501, a feature extraction unit 502, and an optimization unit 503. The voice acquiring unit 501 may be configured to acquire a voice instruction issued by a user during voice interaction between the user and the smart device. The feature extraction unit 502 may be configured to extract a voiceprint feature value of the user from the voice instruction, and add the extracted voiceprint feature value to the set of feature values. The optimization unit 503 may be configured to optimize the existing speech generation model based on the voiceprint feature values in the feature value set if the number of voiceprint feature values in the feature value set reaches a preset threshold.

In this embodiment, the user may have voice interaction with the smart device in a natural language dialog. The voice obtaining unit 501 of the apparatus 500 for optimizing a voice generating model of the present embodiment obtains a voice instruction issued by a user when the user interacts with the smart device. For example, when the vehicle owner speaks an instruction of turning on the air conditioner to the vehicle-mounted voice terminal, the server may obtain the voice instruction of turning on the air conditioner from data submitted by the vehicle-mounted voice terminal through a wired connection manner or a wireless connection manner.

In some optional implementations of this embodiment, the voice acquiring unit 501 may include an audio acquiring module and a cleaning module. Wherein the audio acquisition module may be configured to acquire audio data received by a voice input device of the smart device. The cleansing module may be configured to cleanse the audio data from non-human audio data.

In this embodiment, the feature extraction unit 502 may extract a voiceprint feature value of a user (e.g., a car owner) from the voice instruction received by the voice acquisition unit 501, and add the extracted voiceprint feature value to a feature value set (which may also be referred to as a feature pool). Voiceprints are the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has specificity, but also has the characteristic of relative stability. Here, the voiceprint feature value may be a biometric value used to uniquely identify the user, for example, a Mel Frequency Cepstrum Coefficient (MFCC) value.

In some optional implementations of the present embodiment, the feature extraction unit 502 may include a semantic determination module and a feature extraction module. Wherein the semantic determination module may be configured to determine whether the voice instruction satisfies a preset semantic requirement. The feature extraction module may be configured to extract a voiceprint feature value from the voice instruction if the voice instruction meets a predetermined semantic requirement.

In some optional implementations of this embodiment, the semantic determination module may include a text conversion module, a semantic analysis module, and a determination module. Wherein the text conversion module may be configured to convert the voice instruction into a text instruction. The semantic analysis module may be configured to perform semantic analysis on the text instructions to obtain semantic analysis results. The determination module may be configured to determine whether the voice instruction satisfies a preset semantic requirement based on a semantic analysis result.

In this embodiment, the optimization unit 503 may determine whether the number of the voiceprint feature values in the feature value set reaches a preset threshold, and optimize the existing speech generation model by using the voiceprint feature values in the feature value set when the number of the voiceprint feature values in the feature value set reaches the preset threshold. As an example, when the number of voiceprint feature values in the set of feature values reaches 100, the existing speech generation model is optimized using the 100 voiceprint feature values.

In some optional implementations of this embodiment, the optimization unit 503 may include a vectorization module and a fitting module. Wherein the vectorization module may be configured to vectorize the voiceprint feature values in the feature value set, so that the voiceprint feature values correspond to the corresponding text. The fitting module may be configured to incorporate the vectorized voiceprint feature values into a model dictionary of the existing speech generation model for incremental fitting to obtain an optimized speech generation model. Here, the model dictionary of the existing speech generating model may be a model dictionary in which the existing speech generating model is trained.

In this embodiment, when the intelligent device performs voice interaction with the user, the voice instruction of the user must be submitted to the server to obtain the statement replying to the user, so that the device for optimizing the voice generation model provided by this embodiment does not need to specially obtain the voice of the user, and as long as the user continuously uses the intelligent device, enough voiceprints of the user can be collected to perform iterative update.

The device provided by the above embodiment of the present application, through obtaining the voice command that the user sent during the interaction between the user and the smart device, later extract the voiceprint eigenvalue of the user from the voice command, and optimize the existing voice generation model by using the collected voiceprint eigenvalue when the number of the collected voiceprint eigenvalues reaches the preset number, thereby iteratively updating the voice generation model without collecting the voiceprint of the user specially, and making the voice broadcasted by the smart device more and more similar to the user.

With continued reference to fig. 6, as an implementation of the method shown in fig. 4, the present application provides an embodiment of a device for broadcasting voice, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device may be specifically applied to an intelligent device.

As shown in fig. 6, the apparatus 600 for broadcasting a voice of the present embodiment may include an instruction acquisition unit 601, a voice generation unit 602, and a broadcasting unit 603. The instruction acquisition unit 601 may be configured to acquire a voice instruction issued by a user and convert the voice instruction into a text instruction. The speech generation unit 602 may be configured to input text instructions into the optimized speech generation model, resulting in speech content to be announced. The announcement unit 603 may be configured to announce voice content.

In the present embodiment, the instruction acquisition unit 601 of the apparatus for broadcasting voice 600 may acquire a voice instruction issued by a user in a natural language dialogue, and then convert the voice instruction into a text instruction.

In this embodiment, the speech generating unit 602 may input the text command obtained by the command obtaining unit 601 into an optimized speech generating model, so as to obtain the speech content to be played. The optimized speech generation model may be issued to the execution subject after a server (e.g., the server 105 in fig. 1) optimizes an existing speech generation model in the manner described in the embodiment shown in fig. 2.

In this embodiment, the broadcasting unit 603 can broadcast the voice content obtained by the voice generating unit 602.

According to the device provided by the embodiment of the application, the voice instruction of the user is obtained and converted into the text instruction, then the text instruction is input into the optimized voice generation model to obtain the content to be broadcasted, and finally the voice content is broadcasted, so that the broadcasted sound is closer to the user.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the

smart devices

101, 102, 103 and the server 105 shown in fig. 1) 700 suitable for implementing embodiments of the present application is shown. The electronic device 700 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a mouse, keyboard, microphone, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, and the like; a storage device 708 including, for example, a memory card; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a speech acquisition unit, a feature extraction unit, and an optimization unit. The names of the units do not constitute a limitation to the units themselves in some cases, for example, the voice acquiring unit may also be described as a unit for acquiring voice instructions uttered by a user during voice interaction with the smart device.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the smart device or the server described in the above embodiments; or may exist separately without being assembled into the smart device or server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring a voice instruction sent by a user during voice interaction between the user and a terminal; extracting a voiceprint characteristic value of a user from the voice instruction, and adding the voiceprint characteristic value into the characteristic value set; and if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value, optimizing the existing voice generation model based on the voiceprint characteristic values in the characteristic value set. The one or more programs, when executed by the smart device, cause the smart device to: acquiring a voice instruction sent by a user and converting the voice instruction into a text instruction; inputting the text command into the optimized voice generation model to obtain voice content to be broadcasted; and broadcasting the voice content.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for optimizing a speech generation model, comprising:

acquiring a voice instruction sent by a user during voice interaction between the user and intelligent equipment;

extracting a voiceprint characteristic value of the user from the voice instruction, and adding the voiceprint characteristic value into a characteristic value set, wherein the voice instruction is an instruction meeting preset semantic requirements; wherein the voiceprint characteristic values comprise mel-frequency cepstral coefficient values;

if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value, optimizing the existing voice generation model based on the voiceprint characteristic values in the characteristic value set, wherein the optimizing comprises the following steps: if the number of the Mel cepstrum coefficient values in the characteristic value set reaches the preset threshold value, vectorizing the vocal print characteristic values in the characteristic value set to enable the vocal print characteristic values to correspond to corresponding characters; merging the vectorized voiceprint characteristic value into a model dictionary of the existing voice generation model for incremental fitting to obtain an optimized voice generation model so that voice output by the optimized voice generation model is close to the voice of the user; wherein the optimized speech generation model can be transplanted to other users except the user for use.

2. The method of claim 1, wherein said extracting voiceprint feature values of the user from the voice instructions comprises:

judging whether the voice command meets the preset semantic requirement or not;

and if the voice command meets the preset semantic requirement, extracting a voiceprint characteristic value from the voice command.

3. The method of claim 2, wherein the determining whether the voice command satisfies the predetermined semantic requirement comprises:

converting the voice instruction into a text instruction;

performing semantic analysis on the text instruction to obtain a semantic analysis result;

and determining whether the voice instruction meets the preset semantic requirement based on the semantic analysis result.

4. The method of claim 1, wherein the obtaining of the voice instruction uttered by the user comprises:

acquiring audio data received by a voice input device of the smart device;

and cleaning the audio data and eliminating the non-human voice audio data.

5. A method for broadcasting voice, comprising:

acquiring a voice instruction sent by a user and converting the voice instruction into a text instruction;

inputting the text command into a voice generation model optimized by the method of any one of claims 1 to 4 to obtain voice content to be broadcasted;

and broadcasting the voice content.

6. An apparatus for optimizing a speech generating model, comprising:

the voice acquisition unit is configured to acquire a voice instruction sent by a user during voice interaction between the user and the intelligent equipment;

a feature extraction unit configured to extract a voiceprint feature value of the user from the voice instruction, and add the voiceprint feature value to a feature value set, wherein the voice instruction is an instruction meeting preset semantic requirements; wherein the voiceprint characteristic values comprise mel-frequency cepstral coefficient values;

the optimization unit is configured to optimize an existing voice generation model based on the voiceprint characteristic values in the characteristic value set if the number of the voiceprint characteristic values in the characteristic value set reaches a preset threshold value;

the optimization unit comprises a vectorization module and a fitting module, wherein the vectorization module is configured to vectorize the voiceprint feature values in the feature value set if the number of mel-frequency cepstrum coefficient values in the feature value set reaches the preset threshold value, so that the voiceprint feature values correspond to corresponding characters; the fitting module is configured to incorporate the vectorized voiceprint feature values into a model dictionary of an existing speech generation model for incremental fitting to obtain an optimized speech generation model, so that speech output by the optimized speech generation model is close to the voice of the user; wherein the optimized speech generation model can be transplanted to other users except the user for use.

7. A device for broadcasting voice, comprising:

an instruction acquisition unit configured to acquire a voice instruction issued by a user and convert the voice instruction into a text instruction;

a voice generating unit configured to input the text instruction into a voice generating model optimized by the method of any one of claims 1 to 4 to obtain voice content to be broadcasted;

an announcing unit configured to announce the voice content.

8. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.