CN111210809B

CN111210809B - Voice training data adaptation method and device, voice data conversion method and electronic equipment

Info

Publication number: CN111210809B
Application number: CN201811400134.7A
Authority: CN
Inventors: 张平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2024-03-19
Anticipated expiration: 2038-11-22
Also published as: CN111210809A

Abstract

The embodiment of the invention provides a voice training data adaptation method and device, a voice data conversion method and electronic equipment. The voice training data adaptation method comprises the following steps: acquiring original voice data for data conversion, wherein the original voice data has audio data information in various directions; and converting the original voice data through a channel conversion algorithm to obtain training data applicable to different channels. According to the embodiment of the invention, the existing original voice data is converted through the channel conversion algorithm to obtain the training data adapting to different channels, so that a large number of voice data acquisition of a new voice recognition product for training each time is avoided, and the training data adapting to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm, thereby improving the modeling efficiency of the new voice matching model and saving the labor cost.

Description

Voice training data adaptation method and device, voice data conversion method and electronic equipment

Technical Field

The invention relates to the technical field of intelligent home, in particular to a voice training data adaptation method and device, a voice data conversion method and electronic equipment.

Background

The intelligent sound box is an upgrade product of the sound box, is a tool for a household consumer to acquire songs, weather forecast, news and the like from the cloud through voice input, and can also control other intelligent household equipment, such as opening a curtain through voice input, setting the temperature of a refrigerator, heating a water heater in advance and the like.

Different intelligent sound box products have differences in microphone setting and voice signal processing technology. The service provider (used for providing services such as songs, weather, news and the like) needs to set a voice database matched with the intelligent sound box of different models, voice data in the voice database is used as training data to train a matching model suitable for the intelligent sound boxes of various models, after a user inputs voice by using the intelligent sound box of a certain model, matching operations in aspects such as voiceprint, voice and the like are carried out through the corresponding matching model, and therefore voiceprint recognition or voice recognition is achieved.

In the process of implementing the present invention, the inventors have found that at least the following problems exist in the prior art: with the upgrading and development of technology, new speech recognition products are continuously introduced in the market. After the new product is released, since the stock voice data in the existing voice database is not matched with the new product, the service provider needs to collect a large amount of voice data for the new product, and acquire voice training data suitable for the model voice recognition product for modeling, and the acquisition efficiency is very low.

Disclosure of Invention

The embodiment of the invention provides a voice training data adaptation method and device, a voice data conversion method and electronic equipment, and aims to overcome the defect of low training data acquisition efficiency in the prior art.

To achieve the above objective, an embodiment of the present invention provides a method for adapting voice training data, including:

acquiring original voice data for data conversion, wherein the original voice data has audio data information in various directions;

and converting the original voice data through a channel conversion algorithm to obtain training data applicable to different channels.

The embodiment of the invention also provides a voice data conversion method, which comprises the following steps:

converting original voice data through a channel conversion algorithm matched with playing equipment to obtain training data suitable for the playing equipment, wherein the original voice data has audio data information in all directions;

model training is carried out according to the training data, and a data conversion model is obtained;

and converting the data to be output of the playing equipment according to the data conversion model so as to obtain the playing data suitable for the playing equipment.

The embodiment of the invention also provides a voice training data adapting device, which comprises:

an original voice data acquisition module for acquiring original voice data for data conversion, the original voice data having audio data information in various directions;

and the data conversion module is used for carrying out conversion processing on the original voice data through a channel conversion algorithm so as to obtain training data applicable to different channels.

The embodiment of the invention also provides electronic equipment, which comprises:

a memory for storing a program;

a processor for running the program stored in the memory for:

According to the voice training data adaptation method and device, the voice data conversion method and the electronic equipment, the existing original voice data are converted through the channel conversion algorithm, so that training data adapting to different channels are obtained, the condition that a large number of voice data acquisition is carried out on a new voice recognition product for training each time is avoided, the training data adapting to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm, and therefore modeling efficiency of a new voice matching model is improved, and meanwhile labor cost is saved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a system block diagram of a service system provided in an embodiment of the present invention;

FIG. 2 is a flowchart of one embodiment of a method for adapting speech training data provided by the present invention;

FIG. 3 is a flowchart of another embodiment of a method for adapting speech training data according to the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of a voice training data adaptation apparatus according to the present invention;

fig. 5 is a schematic structural diagram of another embodiment of a voice training data adapting device provided by the present invention;

FIG. 6 is a flowchart of an embodiment of a voice data conversion method according to the present invention;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the prior art, there are different speech recognition products (e.g., smart speaker products) that differ in microphone setup and speech signal processing techniques. The service provider needs to provide a voice database matched with the intelligent sound box of different models, and trains a matching model applicable to various types of voice recognition products by taking voice data in the voice database as training data. After a user inputs voice by using a voice recognition product of a certain model, matching operations in terms of voiceprint, voice and the like can be performed through a corresponding matching model, so that voiceprint recognition or voice recognition is realized. When a new speech recognition product is introduced, since the stock speech data in the existing speech database is not matched with the new product, the service provider needs to collect a large amount of speech data for the new product, and acquire training data suitable for the speech recognition product of the model for modeling, and the acquisition efficiency is very low. Therefore, the application proposes a voice training data adaptation scheme, the main principle of which is: the existing or pre-acquired original voice data (i.e. voice data with audio data information in all directions, such as more complete channel information, more abundant high-frequency information, noise-removed voice data, etc.) are converted through a channel conversion algorithm to obtain training data applicable to different channels (such as two-microphone, four-microphone, six-microphone, etc.), so that a large amount of voice data acquisition is avoided for training a new voice recognition product each time, and training data adapting to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm, thereby improving the modeling efficiency of a matching model of the new voice recognition product and saving the labor cost.

The method provided by the embodiment of the invention can be applied to any business system with voice data processing capability. Fig. 1 is a system block diagram of a service system provided by an embodiment of the present invention, and the structure shown in fig. 1 is only one example of a service system to which the technical solution of the present invention can be applied. As shown in fig. 1, the service system includes a training data adapting device. The device comprises: the raw speech data acquisition module and the data conversion module may be used to perform the process flows shown in fig. 2 and 3 described below.

In the service system, first, original voice data for data conversion having audio data information in various directions is acquired; then, the acquired original voice data is converted by a channel conversion algorithm to obtain training data applicable to different channels. Specifically, the existing original voice data (namely, the high-quality voice data with noise can be removed and the channel information is complete and the high-frequency information is rich) can be directly obtained; the existing stock data can be recorded with high fidelity, so that the original voice data can be obtained; in addition, the voice of the recorder can be recorded by the high-fidelity recording equipment for the data which is not contained in the existing data so as to supplement the voice of the recorder. After the conversion processing is performed by the channel conversion algorithm, training data (such as two-wheat data, four-wheat data, six-wheat data and the like) suitable for different channels are obtained so as to be respectively used for training different matching models (such as two-wheat models, four-wheat models, six-wheat models and the like).

The foregoing embodiments are illustrative of the technical principles and exemplary application frameworks of embodiments of the present invention, and the detailed description of specific technical solutions of the embodiments of the present invention will be further described below by means of a plurality of embodiments.

Example 1

Fig. 2 is a flowchart of an embodiment of a voice training data adaptation method provided by the present invention, where the execution body of the method may be the service system, or may be various server devices with voice data processing capabilities, or may be a device or a chip integrated on these server devices. As shown in fig. 2, the voice training data adaptation method includes the following steps:

s201, original voice data for data conversion is acquired.

In an embodiment of the present invention, the original voice data has audio data information in various directions. The original voice data obtained by recording the existing stock data by the high-fidelity recording device can be obtained in the first database, the original voice data obtained by recording the existing stock data by the high-fidelity recording device can be obtained in the second database, and the original voice data obtained by recording the recording personnel by the high-fidelity recording device can be obtained in the third database.

S202, converting the original voice data through a channel conversion algorithm to obtain training data applicable to different channels.

In the embodiment of the present invention, step S201, i.e., the acquisition process of the original voice data, is independent of the data conversion process. The raw speech data is used as input to the channel conversion algorithm, and the acquisition step is a pre-processed data preparation process. Step S202, i.e. the data conversion process, may be performed whenever corresponding training data is required.

According to the voice training data adaptation method provided by the embodiment of the invention, the channel conversion algorithm is used for carrying out conversion processing operation on the existing original voice data so as to obtain training data adapting to different channels, so that a large number of voice data acquisition of a new voice recognition product is avoided for training each time, the training data adapting to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm, the modeling efficiency of the new voice matching model is improved, and the labor cost is saved.

Example two

Fig. 3 is a flowchart of another embodiment of a voice training data adaptation method provided by the present invention. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, the voice training data adaptation method provided in this embodiment may further include the following steps:

s301, acquiring the existing original voice data in a first database.

S302, obtaining original voice data obtained by recording the existing stock data through high-fidelity recording equipment in a second database.

S303, obtaining the original voice data obtained by recording the voice recorder through the high-fidelity recording equipment in the third database.

In the embodiment of the present invention, the execution sequence of steps S301 to S303 is not limited to sequential order, and may be performed simultaneously, or may be performed sequentially in any order, or, of course, one or two of the three steps may be optionally performed.

In addition, in the voice training data adaptation method provided by the embodiment of the present invention, an acquisition step of a channel conversion algorithm may be further included, as shown in the following steps S304 to S305.

S304, recording data aiming at the fixed text under different channels is obtained.

In the embodiment of the invention, a section of fixed text can be set first, and when a channel conversion algorithm is acquired, recording is performed on the section of fixed text under different channels, for example, under the channel environments of two-microphone, four-microphone, six-microphone and the like and original voice, so as to acquire different recording data.

Further, for the same channel environment, data acquisition at different distances can be performed, and recording data for the fixed text at different distances can be obtained.

S305, obtaining a channel conversion algorithm according to the different parameter distribution functions of different recording data.

In the embodiment of the invention, aiming at recording data under different channels, a channel conversion algorithm can be obtained according to a Gaussian distribution function of the recording data; aiming at recording data under different distances, a channel conversion algorithm can be obtained according to the energy distribution function of the recording data, and finally the channel conversion algorithm which can be used for data conversion is obtained.

S306, converting the original voice data through a channel conversion algorithm to obtain training data applicable to different channels.

In the embodiment of the present invention, steps S301 to S303 (i.e., the acquisition process of the original voice data) are independent of steps S304 to S305 (i.e., the acquisition process of the channel conversion algorithm), the original voice data is taken as the input of the channel conversion algorithm, and the acquisition process thereof can be regarded as a pre-processed data preparation process; the process of obtaining the channel conversion algorithm needs to be executed each time a new smart speaker is generated, so as to update and maintain the old channel conversion algorithm.

Example III

Fig. 4 is a schematic structural diagram of an embodiment of a voice training data adapting device according to the present invention, which may be used to perform the method steps shown in fig. 2. As shown in fig. 4, the voice training data adaptation apparatus may include: the raw speech data acquisition module 41 and the data conversion module 42.

Wherein, the original voice data obtaining module 41 may be used for obtaining original voice data for data conversion; the data conversion module 42 may be configured to perform conversion processing on the original voice data acquired by the original voice data acquisition module 41 through a channel conversion algorithm, so as to obtain training data applicable to different channels.

In an embodiment of the present invention, the original voice data has audio data information in various directions. After the original voice data is acquired by the original voice data acquisition module 41, the data conversion module 42 may perform conversion processing on the original voice data acquired by the original voice data acquisition module 41 through a channel conversion algorithm, so as to obtain training data applicable to different channels. The process of acquiring the original voice data by the original voice data acquisition module 41 is independent of the data conversion process of the data conversion module 42. The raw speech data is used as input to the channel conversion algorithm, and the acquisition step is a pre-processed data preparation process. The data conversion process can be implemented whenever corresponding training data is needed.

According to the voice training data adapting device provided by the embodiment of the invention, the channel conversion algorithm is used for converting the existing original voice data to obtain the training data adapting to different channels, so that a large number of voice data acquisition of a new voice recognition product for training each time is avoided, the training data adapting to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm, the modeling efficiency of the new voice matching model is improved, and the labor cost is saved.

Example IV

Fig. 5 is a schematic structural diagram of another embodiment of the voice training data adapting apparatus provided in the present invention, which may be used to perform the method steps shown in fig. 3. As shown in fig. 5, on the basis of the embodiment shown in fig. 4, the voice training data adapting device provided by the embodiment of the present invention may further include: the algorithm acquisition module 51. The algorithm obtaining module 51 may be configured to obtain recording data for a fixed text under different channels, and obtain a channel conversion algorithm according to a difference parameter distribution function of the different recording data.

In the embodiment of the present invention, a section of fixed text may be set first, and when the channel conversion algorithm is acquired, the algorithm acquisition module 51 may record the section of fixed text under different channels, for example, under two-wheat, four-wheat, six-wheat and other channel environments and high-fidelity channel environments, to acquire different recording data.

Further, the algorithm acquisition module 51 may also be used to acquire recording data for the fixed text at different distances for the same channel environment.

In the embodiment of the present invention, the algorithm acquisition module 51 may acquire a channel conversion algorithm according to a gaussian distribution function for recording data under different channels; aiming at recording data under different distances, a channel conversion algorithm can be obtained according to the energy distribution function of the recording data, and finally the channel conversion algorithm which can be used for data conversion is obtained.

In the embodiment of the present invention, the process algorithm acquiring module 51 of the original voice data acquiring module 41 acquires the process of the channel conversion algorithm, the original voice data is used as the input of the channel conversion algorithm, and the acquiring process can be regarded as a preprocessed data preparing process; the process of obtaining the channel conversion algorithm needs to be executed each time a new smart speaker is generated, so as to update and maintain the old channel conversion algorithm.

Still further, the original voice data acquisition module 41 may include: a first acquisition unit 411, the first acquisition unit 411 may be configured to acquire existing original voice data in a first database.

The original voice data acquisition module 41 may further include: a second obtaining unit 412, where the second obtaining unit 412 may be configured to obtain, in a second database, original voice data obtained by recording existing stock data by a high-fidelity recording device.

The original voice data acquisition module 41 may further include: a third obtaining unit 413, where the third obtaining unit 413 may be configured to obtain, in a third database, original voice data obtained by recording a recording person by a hi-fi recording device.

In the embodiment of the present invention, the acquisition order of the first acquisition unit 411, the second acquisition unit 412, and the third acquisition unit 413 is not separately and sequentially, and may be executed simultaneously, or may be executed sequentially in any order, or may, of course, be executed in any one or two of the three units.

Example five

Fig. 6 is a flowchart of an embodiment of a voice data conversion method according to the present invention. The execution subject of the method may be various server devices with voice data processing capability, or may be devices or chips integrated on these server devices. As shown in fig. 6, the voice data conversion method includes the steps of:

s601, converting the original voice data through a channel conversion algorithm matched with the playing device to obtain training data suitable for the playing device.

In the embodiment of the present invention, the original voice data refers to voice data having audio data information in various directions.

Regarding the acquisition of the original voice data, the existing original voice data may be acquired in the first database, the original voice data obtained by recording the existing stock data by the hi-fi recording device may be acquired in the second database, and the original voice data obtained by recording the recording person by the hi-fi recording device may be acquired in the third database.

A voice playing device, which needs To play voice according To the configured voice database when playing TTS (Text To Speech). And for different models of playing equipment, voice databases of different channels need to be configured. According to the voice data conversion method provided by the embodiment of the invention, when a new playing device is generated, a server providing support for the playing device can acquire the channel conversion matched with the playing device according to the belief type of the playing device so as to acquire training data suitable for the playing device.

Specifically, when the channel conversion algorithm matched with the playing device is acquired, the following steps may be taken: acquiring recording data aiming at a fixed text under different channels, wherein the recording data comprises recording data aiming at the fixed text by playing equipment; and then, obtaining a channel conversion algorithm according to the difference parameter distribution function of different recording data.

Aiming at recording data under different channels, a channel conversion algorithm can be obtained according to a Gaussian distribution function of the recording data.

S602, performing model training according to the training data to obtain a data conversion model.

And S603, converting the data to be output of the playing device according to the data conversion model so as to obtain playing data suitable for the playing device.

In the embodiment of the invention, after obtaining the training data suitable for the playing device, the server performs model training, so as to obtain a data conversion model.

When the playing device plays the voice, the data to be output can be sent to the server, the server inputs the data to be output into the data conversion model, and the model automatically outputs the playing data suitable for the playing device. When the playing device receives the playing data from the server, the playing device can play the playing data.

According to the voice data conversion method provided by the embodiment of the invention, the existing original voice data is converted and processed through the channel conversion algorithm matched with the playing equipment to obtain the training data matched with the playing equipment, so that a large amount of voice data acquisition of a new voice playing product can be avoided each time, the training data matched with the voice playing product can be obtained only by updating and maintaining the channel conversion algorithm, thereby training a data conversion model, realizing the conversion of the data to be played of a new product, improving the voice playing quality and saving the labor cost in data acquisition.

Example six

The internal functions and structures of a speech training data adaptation apparatus are described above, which may be implemented as an electronic device. Fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention. As shown in fig. 7, the electronic device includes a memory 71 and a processor 72.

A memory 71 for storing a program. In addition to the programs described above, the memory 71 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 71 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A processor 72 coupled to the memory 71, executing a program stored in the memory 71 for:

acquiring original voice data for data conversion, the original voice data having audio data information in various directions;

and converting the acquired original voice data through a channel conversion algorithm to acquire training data applicable to different channels.

Further, as shown in fig. 7, the electronic device may further include: communication component 73, power component 74, audio component 75, display 76, and the like. Only some of the components are schematically shown in fig. 7, which does not mean that the electronic device only comprises the components shown in fig. 7.

The communication component 73 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 73 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 73 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

A power supply assembly 74 provides power to the various components of the electronic device. The power components 74 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 75 is configured to output and/or input audio signals. For example, the audio component 75 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 71 or transmitted via the communication component 73. In some embodiments, the audio component 75 further comprises a speaker for outputting audio signals.

The display 76 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for adapting speech training data, comprising:

recording data aiming at a fixed text under different channels are obtained, wherein the channels comprise a two-wheat channel, a four-wheat channel, a six-wheat channel and a high-fidelity channel;

obtaining a channel conversion algorithm according to different difference parameter distribution functions of the recording data, wherein the difference parameter distribution functions comprise Gaussian distribution functions;

and converting the original voice data through the channel conversion algorithm to obtain training data suitable for different channels, wherein the training data are used for model training of a data conversion model so as to convert data to be output of playing equipment according to the data conversion model trained by the model to obtain playing data suitable for the playing equipment, and the data conversion model comprises a two-microphone model, a four-microphone model and/or a six-microphone model.

2. The method for adapting speech training data according to claim 1, further comprising:

recording data aiming at the fixed text under different distances is obtained.

3. The method for adapting speech training data according to claim 1, wherein the variance parameter distribution function of the recorded data under different channels is a gaussian distribution function.

4. The speech training data adaptation method of claim 2, wherein the difference parameter distribution function of the recorded data at different distances is an energy distribution function.

5. The method for adapting speech training data according to any one of claims 1 to 4, wherein the obtaining the original speech data for data conversion comprises:

existing raw speech data is obtained in a first database.

6. The method for adapting speech training data according to any one of claims 1 to 4, wherein the obtaining the original speech data for data conversion comprises:

and acquiring the original voice data obtained by recording the existing stock data through the high-fidelity recording equipment in the second database.

7. The method for adapting speech training data according to any one of claims 1 to 4, wherein the obtaining the original speech data for data conversion comprises:

and acquiring the original voice data obtained by recording the voice recorder through the high-fidelity recording equipment in a third database.

8. A method for converting voice data, comprising:

model training is carried out according to the training data to obtain a data conversion model, wherein the data conversion model comprises a two-wheat model, a four-wheat model and/or a six-wheat model;

9. The voice data conversion method of claim 8, wherein the recording data comprises recording data of the playback device for the fixed text.

10. A speech training data adaptation apparatus, comprising:

the recording data acquisition module is used for acquiring recording data aiming at fixed texts under different channels, wherein the channels comprise a two-wheat channel, a four-wheat channel, a six-wheat channel and a high-fidelity channel;

the channel conversion algorithm acquisition module is used for acquiring a channel conversion algorithm according to different difference parameter distribution functions of the recording data, wherein the difference parameter distribution functions comprise Gaussian distribution functions;

the data conversion module is used for converting the original voice data through the channel conversion algorithm to obtain training data suitable for different channels, the training data are used for model training of a data conversion model, the data to be output of the playing device are converted according to the data conversion model after model training to obtain playing data suitable for the playing device, and the data conversion model comprises a two-wheat model, a four-wheat model and/or a six-wheat model.

11. An electronic device, comprising:

a memory for storing a program;

a processor for running the program stored in the memory for: