CN111243573A

CN111243573A - Voice training method and device

Info

Publication number: CN111243573A
Application number: CN201911414858.1A
Authority: CN
Inventors: 张景平
Original assignee: Shenzhen Ruixun Cloud Technology Co ltd
Current assignee: Shenzhen Ruixun Cloud Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-06-05
Anticipated expiration: 2039-12-31
Also published as: CN111243573B

Abstract

The embodiment of the invention provides a voice training method and a voice training device, which are applied to an artificial intelligence system, wherein the method comprises the following steps: when the artificial intelligence system acquires voice data, acquiring current context information; acquiring an audio waveform corresponding to the voice data according to the context information; judging whether the audio waveform is a preset audio waveform or not; and if the audio waveform is a preset audio waveform, performing voice training by using the audio data. The voice training method provided by the embodiment can extract each language feature and different voice data from the voice data, mix the different voice data or voice features with the preset voice data, and train the voice data, so that the recognition capability of the artificial intelligence system can be improved, the matching and recognition accuracy of the artificial intelligence system can be improved, the instruction of a user can be accurately recognized, the use experience of the user is improved, the calculation amount in the recognition process is small, and the power consumption of the system is reduced.

Description

Voice training method and device

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a voice training method and a voice training apparatus.

Background

With the continuous popularization of the internet, the artificial intelligence system gradually enters a part of the life of people, and convenience is provided for the life of people.

The artificial intelligence system can execute the operation corresponding to the voice data by identifying the voice data of the user, and provides convenience for the life of the user.

The current artificial intelligence system executes corresponding operation after acquiring voice data. In order to improve the accuracy of recognition, the voice data for training the deep neural network is obtained by generally adopting manual recording and manual labeling modes.

However, the conventional method can only label human voice or single voice, but cannot distinguish different voice data objects, and when other voices occur, the external voice data and the target voice data are easily mixed up, so that the executed operation deviates from the expectation of the user.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed to provide a speech training method and a speech training apparatus that overcome or at least partially solve the above-mentioned problems.

In order to solve the above problem, an embodiment of the present invention discloses a speech training method, which is applied to an artificial intelligence system, and the method includes:

when the artificial intelligence system acquires voice data, acquiring current context information;

acquiring an audio waveform corresponding to the voice data according to the context information;

judging whether the audio waveform is a preset audio waveform or not;

and if the audio waveform is a preset audio waveform, performing voice training by using the audio data.

Optionally, the context information includes character information, environment information, and animal information, and the voice data includes character voice data, environment voice data, and animal voice data;

the obtaining of the audio waveform corresponding to the voice data according to the context information includes:

determining whether the contextual information is human information;

if the character information is the character information, obtaining character waveforms corresponding to the character voice data, and judging whether the context information is environment information or not;

if the environmental information is the environmental information, acquiring an environmental waveform corresponding to the environmental voice data, and judging whether the context information is animal information;

and if the animal information is the animal information, acquiring an animal waveform corresponding to the animal voice data.

Optionally, the determining whether the audio waveform is a preset audio waveform includes:

judging whether the figure waveform is a preset waveform or not;

if the figure waveform is a preset waveform, judging whether the environment waveform is a preset waveform or not;

if the environment waveform is a preset waveform, judging whether the animal waveform is a preset waveform;

if the audio waveform is a preset audio waveform, performing voice training by using the audio data, including:

if the animal waveform is a preset waveform, synthesizing training data by adopting human voice data, environment voice data and animal voice data;

and performing voice training by using the training data.

Optionally, the method comprises:

if the audio waveform is not a preset audio waveform, extracting audio voice features from the audio data;

combining preset character voice features with the audio voice features to generate training voice features;

and performing voice training by adopting the training voice characteristics.

The embodiment of the invention also discloses a voice training device, which is applied to an artificial intelligence system, and the device comprises:

the acquisition module is used for acquiring current context information when the artificial intelligence system acquires voice data;

the audio waveform module is used for acquiring an audio waveform corresponding to the voice data according to the context information;

the judging module is used for judging whether the audio waveform is a preset audio waveform;

and the training module is used for performing voice training by adopting the audio data if the audio waveform is a preset audio waveform.

the audio waveform module, comprising:

a determination module for determining whether the context information is human information;

the environment judgment module is used for acquiring the figure waveform corresponding to the figure voice data if the figure information is the figure information, and judging whether the context information is the environment information;

the animal judgment module is used for acquiring an environment waveform corresponding to the environment voice data if the environmental information exists, and judging whether the context information is animal information;

and the animal waveform module is used for acquiring an animal waveform corresponding to the animal voice data if the animal information is animal information.

Optionally, the determining module includes:

the figure waveform judging module is used for judging whether the figure waveform is a preset waveform or not;

the environment waveform judging module is used for judging whether the environment waveform is a preset waveform or not if the figure waveform is the preset waveform;

the animal waveform judging module is used for judging whether the animal waveform is a preset waveform or not if the environment waveform is the preset waveform;

the training module comprises:

the synthesis module is used for synthesizing training data by adopting human voice data, environment voice data and animal voice data if the animal waveform is a preset waveform;

and the data module is used for carrying out voice training by adopting the training data.

Optionally, the apparatus comprises:

the extraction module is used for extracting audio voice features from the audio data if the audio waveform is not a preset audio waveform;

the merging module is used for merging preset character voice features and the audio voice features to generate training voice features;

and the training voice characteristic module is used for carrying out voice training by adopting the training voice characteristics.

The embodiment of the invention also discloses a device, which comprises:

one or more processors; and

one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform one or more methods as described in the embodiments above.

The embodiment of the invention also discloses a computer readable storage medium, which stores a computer program for enabling a processor to execute any one of the methods described in the above embodiments.

The embodiment of the invention has the following advantages: the method comprises the steps that when the artificial intelligence system obtains voice data, current context information is obtained; acquiring an audio waveform corresponding to the voice data according to the context information; judging whether the audio waveform is a preset audio waveform or not; and if the audio waveform is a preset audio waveform, performing voice training by using the audio data. The voice training method provided by the embodiment is simple and convenient to operate, can extract various language features and different voice data from the voice data, mixes different voice data or voice features with preset voice data, and trains by adopting the mixed voice data, so that the recognition capability of the artificial intelligence system can be improved, the accuracy of matching and recognition of the artificial intelligence system is also improved, the instruction of a user can be accurately recognized, the use experience of the user is improved, the calculated amount in the recognition process is small, and the power consumption of the system is also reduced.

Drawings

FIG. 1 is a flowchart illustrating the steps of a first embodiment of a speech training method according to the present invention;

FIG. 2 is a flowchart illustrating the steps of a second embodiment of the speech training method of the present invention;

FIG. 3 is a schematic structural diagram of a first speech training apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart of the first step of the first embodiment of the speech training method of the present invention is shown, in this embodiment, the method may be applied to an artificial intelligence system, which may be an application system developed by using artificial intelligence technology or knowledge engineering technology, or a knowledge-based software engineering auxiliary system, or an intelligent operating system researched by integrating an operating system and artificial intelligence with cognitive science, or a mobile terminal, a computer terminal, or a similar computing device, etc. In a particular implementation, the artificial intelligence system may be a voice intelligence system. The voice intelligence system may include a voice receiving device for receiving voice data, a recognition device for recognizing voice data, an infrared sensor, a heat source detector, one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory for storing data.

The memory may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the message receiving method in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the computer program stored in the memory, that is, implements the method described above. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the mobile terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In this embodiment, the method may include:

step 101, when the artificial intelligence system acquires voice data, acquiring current context information.

In the present embodiment, the voice data may be voice data input by the user, instruction information of the user, or the like. The current context information may be current environmental information, weather information, time information, geographical information, and the like. Such as the current geographic location, air humidity, weather conditions, number of user characters, current time, voice objects, etc. But also environmental sound data of the current environment, such as animal sound, environmental sound, sound of an object, such as animal sound data, sound data of a car, sound emitted by a pendulum clock, and the like.

In alternative embodiments, the artificial intelligence system may be provided with sensing devices that may include heat source sensors, humidity sensors, communication devices, and the like. When the artificial intelligence system obtains the voice data input by the user, the sensing device can be immediately called to obtain the context information.

In this embodiment, the current context information is obtained so that the current context information and the voice data of the user can be used for performing mixed training, so that the recognition accuracy of the artificial intelligence system can be improved.

And 102, acquiring an audio waveform corresponding to the voice data according to the contextual information.

In this embodiment, the context information may include current environmental information, weather information, time information, geographic information, and the like. Such as the current geographic location, air humidity, weather conditions, number of user characters, current time, voice objects, etc. Or environmental sound data of the current environment, such as a speaking sound of multiple users, an animal sound, an environmental sound, a sound of an object, such as a sound data of an animal, a sound data of an automobile, a sound emitted by a pendulum clock, and the like. Different information can correspond to different audio frequency waveforms, different audio frequency data can be integrated through different waveforms, and therefore voice training can be conducted on the artificial intelligence system through different audio frequency data, and the purpose of enhancing the practicability of the artificial intelligence system is achieved.

Specifically, different sound data of the context information can be respectively collected, audio waveforms corresponding to the different sound data are obtained, the audio waveforms corresponding to the different sound data are mixed with the audio waveforms of the voice data of the user, and the mixed audio is adopted for voice training.

For example, the context information includes a car sound, a dog call sound, an audio waveform corresponding to the car sound, an audio waveform corresponding to the dog call sound, and an audio waveform corresponding to the voice data input by the user. And then mixing the audio waveform corresponding to the automobile sound with the audio waveform corresponding to the voice data input by the user, and mixing the audio waveform corresponding to the dog call sound with the audio waveform corresponding to the voice data input by the user. And then respectively adopting two mixed audio data to carry out voice training.

In an optional embodiment, the context information comprises character information, environment information and animal information, and the voice data comprises character voice data, environment voice data and animal voice data;

specifically, the personal information may be the number of persons; the environment information can be geographic position and the number of objects; the animal information may be the number of animals.

And 103, judging whether the audio waveform is a preset audio waveform or not.

In the present embodiment, the preset audio waveform may be an audio waveform of a noisy speech sample, an audio waveform of a noiseless speech sample, an audio waveform of a background noise sample, an audio waveform of a human voice, an animal sound waveform, an object sound waveform, or the like.

After the audio waveform is obtained, whether the audio waveform is the same as the preset audio waveform or not can be judged, if so, the obtained audio waveform can be mixed with the audio waveform corresponding to the voice data of the user, and the artificial intelligence system is used for carrying out voice training.

And step 104, if the audio waveform is a preset audio waveform, performing voice training by using the audio data.

In this embodiment, when the audio waveform is the same as the preset audio waveform, it may be determined that the audio waveform corresponding to the obtained voice data is the preset audio waveform, and the voice data may be directly input into a voice training model preset in the artificial intelligence system, and the voice training model performs repeated calculation to obtain a training result.

In an optional embodiment of the present invention, a speech training method is provided, wherein when the artificial intelligence system obtains speech data, current context information is obtained; acquiring an audio waveform corresponding to the voice data according to the context information; judging whether the audio waveform is a preset audio waveform or not; and if the audio waveform is a preset audio waveform, performing voice training by using the audio data. The voice training method provided by the embodiment is simple and convenient to operate, can extract various language features and different voice data from the voice data, mixes different voice data or voice features with preset voice data, and trains by adopting the mixed voice data, so that the recognition capability of the artificial intelligence system can be improved, the accuracy of matching and recognition of the artificial intelligence system is also improved, the instruction of a user can be accurately recognized, the use experience of the user is improved, the calculated amount in the recognition process is small, and the power consumption of the system is also reduced.

Referring to fig. 2, a flowchart of the steps of the second embodiment of the speech training method of the present invention is shown, and the method can be applied to an artificial intelligence system, which can be an application system developed by using artificial intelligence technology or knowledge engineering technology, or a knowledge-based software engineering auxiliary system, or an intelligent operating system researched by integrating an operating system with artificial intelligence and cognitive science, or a mobile terminal, a computer terminal, or a similar computing device, etc. In a particular implementation, the artificial intelligence system may be a voice intelligence system. The voice intelligence system may include a voice receiving device for receiving voice data, a recognition device for recognizing voice data, an infrared sensor, a heat source detector, one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory for storing data.

In this embodiment, the method may include:

step 201, when the artificial intelligence system acquires voice data, current context information is acquired.

Step 202, obtaining an audio waveform corresponding to the voice data according to the context information.

Optionally, step 202 may include the following sub-steps:

sub-step 2021, determining whether the context information is human information.

In this embodiment, the number of people may be determined. Specifically, the artificial intelligence system can be provided with a heat sensor, the heat sensor can be adopted to acquire heat sources around the artificial intelligence system in a detectable radius range, and the number of the current people can be determined according to the numerical value of the heat sources.

For example, people may be determined to be the heat source at about 37 degrees celsius, and the number of people at about 37 degrees celsius may be calculated to obtain the current number of people.

By acquiring the number of persons in the current period, it is possible to determine whether the acquired voice data includes sound data of the persons. When the number of persons is greater than or equal to one, it may be determined that the acquired voice data includes person voice data; when the number of persons is less than one, it may be determined that the acquired voice data does not include person voice data.

In the substep 2022, if the context information is the personal information, the personal waveform corresponding to the personal voice data is obtained, and whether the context information is the environmental information is determined.

In this embodiment, when the artificial intelligence system determines that the context information is the person information through the recognized heat source, the artificial intelligence system may determine the current number of persons, may determine that the voice data includes person voice data, and may acquire a person waveform corresponding to the person voice data. The human waveform may be an audio waveform corresponding to a human voice. And whether the context information is environment information can be judged after the audio waveform corresponding to the human voice is acquired.

Specifically, the environment information may be object number information or position information of the object. In this embodiment, the artificial intelligence system may be provided with an infrared sensor, and may transmit infrared rays to the surroundings within a set radius by using the infrared sensor, and if the infrared rays bounce, it may be determined that there is an object. The number of rebounded infrared rays can be counted to determine the number of objects.

Whether a specific object exists in the periphery of the artificial intelligence system can be determined through the infrared sensor, so that whether the acquired voice data includes the object voice data can be determined.

In an alternative embodiment, the artificial intelligence system may perform a fourier transform on the acquired speech data to convert the speech data in the time domain into a corresponding waveform spectrum in the frequency domain.

And a substep 2023, if the environmental information is the environmental information, acquiring an environmental waveform corresponding to the environmental voice data, and determining whether the contextual information is animal information.

In this embodiment, when the artificial intelligence system recognizes that the passing infrared ray determines that the context information is the environment information, the artificial intelligence system may determine that the current number of objects may be determined, may determine that the voice data includes environment voice data, and may acquire an environment waveform corresponding to the environment voice data. The ambient waveform may be an audio waveform corresponding to a sound of an object in the environment. And whether the context information is animal information can be judged after the audio waveform corresponding to the object sound is acquired.

Specifically, the animal information may be animal number information or animal position information. In this embodiment, the artificial intelligence system may also use the heat sensor to obtain the heat sources around the artificial intelligence system within the detectable radius range, and may determine the current number of people according to the values of the heat sources.

For example, the heat source is above 38 degrees celsius and below 45 degrees celsius can be determined as a poultry animal, and the number of animals with heat sources above 38 degrees celsius and below 45 degrees celsius can be calculated to obtain the current number of animals.

And a substep 2024, if the animal information is animal information, acquiring an animal waveform corresponding to the animal voice data.

In this embodiment, when the artificial intelligence system determines that the context information is animal information through the identified heat source, the artificial intelligence system may determine that the current number of animals, may determine that the voice data includes animal voice data, and may acquire an animal waveform corresponding to the animal voice data. The animal waveform may be an audio waveform corresponding to an animal sound.

Specifically, the artificial intelligence system may also perform fourier transform on the acquired voice data, and convert the voice data in the time domain into a corresponding waveform spectrum in the frequency domain.

In this implementation, after the audio waveform of the person, the audio waveform of the environment, and the audio waveform of the animal are obtained, the three waveforms may be respectively adopted to mix with the audio waveform of the voice data of the user, so that the mixed audio waveform may be obtained for the artificial intelligence system to perform voice training. The voice training capacity of the artificial intelligence system can be improved, and the voice training range of the artificial intelligence system can be expanded.

Step 203, determining whether the audio waveform is a preset audio waveform.

Optionally, step 203 may comprise the sub-steps of:

substep 2031, determining whether the human waveform is a preset waveform.

Substep 2032, if the human waveform is a preset waveform, determining whether the environment waveform is a preset waveform.

Substep 2033, if the environment waveform is a preset waveform, determining whether the animal waveform is a preset waveform.

In this embodiment, the artificial intelligence system may obtain waveform characteristics of the waveform, such as amplitude, period, wavelength, decibel, sound power, and sound intensity, and waveform characteristics of a preset audio waveform, and then compare whether the human waveform, the environmental waveform, and the animal waveform are the same as the preset waveform, respectively.

Specifically, the waveform characteristics of the human waveform, such as amplitude, period, wavelength, decibel, sound power and sound intensity, may be obtained first, then the waveform characteristics of the preset waveform, such as amplitude, period, wavelength, decibel, sound power and sound intensity, may be obtained, then whether the waveform characteristics of the human waveform are the same as the waveform characteristics of the preset waveform or not may be compared, if the waveform characteristics of the human waveform are the same as the waveform characteristics of the preset waveform, the waveform characteristics of the environmental waveform may be obtained, then whether the waveform characteristics of the environmental waveform are the same as the waveform characteristics of the preset waveform or not may be compared, the waveform characteristics of the animal waveform may be obtained, and then whether the waveform characteristics of the animal waveform are the same as the waveform characteristics of the preset waveform or not may be compared.

Whether this audio frequency wave form is the same with predetermineeing audio frequency wave form through judging, can mix the audio frequency wave form that acquires and user's speech data correspond, carry out the speech training by artificial intelligence system, improve artificial intelligence system speech training's practicality, let artificial intelligence system can adopt different speech data to train.

In another alternative embodiment, step 203 may further include the sub-steps of:

sub-step 2034, if the audio waveform is not a preset audio waveform, extracting audio speech features from the audio data.

In this embodiment, the audio speech features may be text features, unvoiced features, voiced features, and the like.

When the audio waveform is different from the preset audio waveform, the characteristics such as text characteristics, unvoiced characteristics, voiced characteristics and the like can be extracted from the audio data, and each characteristic is substituted into the artificial intelligence system for voice training.

Specifically, when the noise feature is extracted, voiced sound detection may be performed on the speech signal based on a zero-crossing rate, a voiced sound segment is extracted from the speech signal, the voiced sound segment is taken as the voiced sound feature, wherein a threshold value of the zero-crossing rate is a second threshold value, and the second threshold value is greater than the first threshold value.

Optionally, when performing voiced sound detection on a speech signal based on a zero-crossing rate, regarding two adjacent sampling points tmp1 and tmp2 in a speech frame of the speech signal, when tmp1 × tmp2<0 and | tmp1-tmp2| > T2 are simultaneously satisfied, determining that the speech frame crosses zero once, and accordingly counting a zero-crossing rate of the speech frame, where T2 is a second threshold; then, the voice frames with the zero crossing rate larger than the preset value are extracted from the voice signals to form voiced sound segments. The preset value can be set according to actual needs. The second threshold T2 is greater than the first threshold T1, preferably 8% -15% (e.g., 10%) of the average magnitude of the speech signal.

Optionally, when performing voiced sound detection on a speech signal based on a zero-crossing rate, for adjacent sampling point pairs tmp1 and tmp2 in the speech signal, when tmp1 × tmp2<0 and | tmp1-tmp2| > T2 are simultaneously satisfied, the zero-crossing rate is determined to be 1, otherwise, the zero-crossing rate is determined to be 0, where T2 is a second threshold value; then, the electronic equipment extracts all the sampling point pairs with the zero crossing rate of 1 from the voice signal to form voiced segments corresponding to the data segments.

For example, voiced sound detection is performed using the following formula:

signs＝(tmp1.*tmp2)<0；

diffs＝|tmp1-tmp2|>T2；

zcr＝(signs.*diffs)；

wherein signs is the position where zero crossing occurs, tmp1 and tmp2 are adjacent sampling point pairs in the voice signal, tmp1 and tmp2 correspond to position data multiplication (. beta.) representing the dot product of two vectors, and signs is 1 if less than 0, otherwise is 0; diffs is based on the amplitude difference position of the point, when the absolute value of the difference between tmp1 and tmp2 is greater than a second threshold value T2, the variable value diffs is 1, otherwise, it is 0; zcr is a point-based zero-crossing rate, which is 1 when signs <0 and diffs > T2, and 0 otherwise, thereby completely nulling the zero-crossing rates of unvoiced and noise, while only the zero-crossing rate of speech (voiced) is preserved.

The second threshold T2 may be 8% -20% of the average value of the amplitudes of the detected speech signals (i.e., the average amplitude), for example, assuming that the average amplitude is 0.2, the second threshold T2 ═ 0.2x10 ═ 0.02.

Similarly, the above method may also be used to extract unvoiced sound, and is not described herein again to avoid repetition.

In a specific implementation, the text feature may be an audio feature corresponding to a keyword, and may be an audio waveform of the keyword. Specifically, the artificial intelligence system may also extract feature parameters of the audio waveform of the keyword, and then perform keyword matching using the feature parameters. Such as Linear Prediction Coefficient (LPC), perceptual Linear Prediction Coefficient (Pe ptua l Linear Prediction ict ve, PL P), Linear Prediction Cepstrum Coefficient (LPCC), Mel-frequency Cepstrum Coefficient (MFCC), and so on. The audio waveform of the corresponding keyword can be calculated according to the coefficient, so that the audio characteristics corresponding to the keyword can be obtained.

Substep 2035, combining the preset character voice features with the audio voice features to generate training voice features.

In this embodiment, after the audio speech feature is obtained, the obtained audio speech feature may be combined with a character speech feature preset by the user, so as to obtain a training speech feature for training the artificial intelligence system.

The artificial intelligence system can adopt training voice characteristics to carry out voice training, thereby improving the voice recognition capability according to the training result.

Substep 2036, performing speech training using the training speech features.

In this embodiment, the artificial intelligence system can use the training speech features to perform speech training repeatedly.

Specifically, the training speech features may be input into a speech training model preset in the artificial intelligence system, and the speech training model is repeatedly calculated to obtain a training result.

And 204, if the audio waveform is a preset audio waveform, performing voice training by using the audio data.

Optionally, step 204 may include the following sub-steps:

and a substep 2041 of synthesizing training data by using human voice data, environmental voice data and animal voice data if the animal waveform is a preset waveform.

Specifically, when the human waveform, the environment waveform and the animal waveform are the same as the preset waveforms, it can be determined that the human waveform, the environment waveform and the animal waveform are the preset waveforms, and the human voice data, the environment voice data and the animal voice data can be mixed to obtain mixed training data.

Specifically, the human voice data, the environmental voice data, and the animal voice data may be combined end to end, may be combined in any order, may be combined into a mixed data, and the like. The present invention can be modified according to the actual needs, and the present invention is not limited thereto.

And a substep 2042 of performing speech training using the training data.

In this embodiment, after the mixed training data is obtained, the speech data may be directly input into a speech training model preset in the artificial intelligence system for repeated calculation, so as to obtain a training result. And the artificial intelligence system adopts mixed training data to carry out voice training, so that the recognition capability of the artificial intelligence system can be improved, and the recognition accuracy of the artificial intelligence system can be effectively improved.

Step 205, generating and storing the training result.

In this embodiment, after performing voice training using training data, the artificial intelligence system may generate a training result and store the training result.

Specifically, the artificial intelligence system can be connected with an external device, and the external device can be an external terminal, a server, an intelligent device and the like. The user can know the training result of the artificial intelligence system through the external equipment, and can adjust the training method or adjust the artificial intelligence system according to the training result.

In the preferred embodiment of the invention, a voice training method is provided, when the artificial intelligence system acquires voice data, current context information is acquired; acquiring an audio waveform corresponding to the voice data according to the context information; judging whether the audio waveform is a preset audio waveform or not; if the audio waveform is a preset audio waveform, performing voice training by using the audio data; and generating and storing a training result. The voice training method provided by the embodiment is simple and convenient to operate, can extract various language features and different voice data from the voice data, mixes different voice data or voice features with preset voice data, and trains by adopting the mixed voice data, so that the recognition capability of the artificial intelligence system can be improved, the accuracy of matching and recognition of the artificial intelligence system is also improved, the instruction of a user can be accurately recognized, the use experience of the user is improved, the calculated amount in the recognition process is small, and the power consumption of the system is also reduced.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a schematic structural diagram of one embodiment of the speech training apparatus of the present invention is shown, in this embodiment, the apparatus may be applied to an artificial intelligence system, and the apparatus includes:

an obtaining module 301, configured to obtain current context information when the artificial intelligence system obtains voice data;

an audio waveform module 302, configured to obtain an audio waveform corresponding to the voice data according to the context information;

a judging module 303, configured to judge whether the audio waveform is a preset audio waveform;

a training module 304, configured to perform voice training using the audio data if the audio waveform is a preset audio waveform.

the audio waveform module, comprising:

Optionally, the determining module includes:

the training module comprises:

Optionally, the apparatus comprises:

Optionally, the apparatus may further include:

and the generating module is used for generating and storing the training result.

This embodiment proposes a speech training device, and the device may include: an obtaining module 301, configured to obtain current context information when the artificial intelligence system obtains voice data; an audio waveform module 302, configured to obtain an audio waveform corresponding to the voice data according to the context information; a judging module 303, configured to judge whether the audio waveform is a preset audio waveform; training module 304, be used for if the audio frequency wave form is predetermined audio frequency wave form, then adopt audio data carries out the pronunciation trainer that this embodiment of speech training proposed, easy and simple to handle, can extract each language feature and different speech data from speech data, and utilize different speech data or speech feature and predetermined speech data to mix, adopt the speech data of mixing again to train, thereby can improve artificial intelligence system's recognition ability, the degree of accuracy that has also improved artificial intelligence system matching and discernment, user's instruction can be accurately discerned, improve user's use and experience, and the amount of calculation in the recognition process is little, system's consumption has also been reduced.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

An embodiment of the present invention further provides an apparatus, including:

the method comprises one or more processors, a memory and a machine-readable medium stored in the memory and capable of running on the processor, wherein the machine-readable medium is implemented by the processor to realize the processes of the method embodiments, and can achieve the same technical effects, and the details are not repeated here to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the processes of the foregoing method embodiments, and can achieve the same technical effects, and is not described herein again to avoid repetition.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is provided for a voice training method and a voice training device, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A speech training method is applied to an artificial intelligence system, and comprises the following steps:

judging whether the audio waveform is a preset audio waveform or not;

2. The method of claim 1, wherein the context information comprises human information, environmental information, and animal information, and the voice data comprises human voice data, environmental voice data, and animal voice data;

determining whether the contextual information is human information;

3. The method of claim 2, wherein the determining whether the audio waveform is a preset audio waveform comprises:

judging whether the figure waveform is a preset waveform or not;

and performing voice training by using the training data.

4. The method of claim 2, further comprising:

and performing voice training by adopting the training voice characteristics.

5. A speech training apparatus for use in an artificial intelligence system, the apparatus comprising:

6. The apparatus of claim 5, wherein the context information comprises human information, environmental information, and animal information, and the voice data comprises human voice data, environmental voice data, and animal voice data;

the audio waveform module, comprising:

7. The apparatus of claim 6, wherein the determining module comprises:

the training module comprises:

8. The apparatus of claim 6, wherein the apparatus comprises:

9. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more methods of claims 1-4.

10. A computer-readable storage medium, characterized in that it stores a computer program for causing a processor to execute the method according to any one of claims 1 to 4.