CN114596873A

CN114596873A - Voice enhancement method and device and robot

Info

Publication number: CN114596873A
Application number: CN202011398131.1A
Authority: CN
Inventors: 齐园蕾; 李炯亮
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-06-07

Abstract

The application discloses a voice enhancement method, which comprises the following steps: acquiring audio information collected by a robot; generating voice existence probability including voice in the audio information according to the audio information; and enhancing the audio information according to the voice existence probability to generate enhanced voice. According to the embodiment of the disclosure, the voice existence probability of the voice is generated according to the audio information, and the audio information can be enhanced according to the voice existence probability when the voice enhancement is carried out, so that the enhancement effect is improved. In the embodiment of the disclosure, the accuracy of the speech enhancement algorithm can be improved through the speech existence probability, so that the speech enhancement effect is realized.

Description

Voice enhancement method and device and robot

Technical Field

The present disclosure relates to the field of robotics, and in particular, to a method and an apparatus for speech enhancement, a robot, and a storage medium.

Background

With the continuous development of robots, robotic pets are becoming more and more popular. However, robotic pets, such as legged robots, are constantly moving during interaction with human speech. Unlike traditional fixed intelligent equipment (such as a smart sound box), the machine pet generates a lot of noises due to the continuous movement of the machine pet, such as the noises of a driving motor, the mechanical transmission noises of a joint part during movement, and the like, and the noises can generate great interference on voice recognition.

Therefore, there is a need to suppress noise interference among the audio collected by the robot, thereby enhancing the voice. Since noise in a robot is mostly generated due to robot motion and is sudden, a conventional speech enhancement algorithm often cannot accurately estimate the speech existence probability, that is, it is difficult to detect two end points (i.e., a start point and an end point) of speech. Conventional speech enhancement algorithms are therefore unable to effectively enhance in the presence of bursty noise.

Therefore, it can be seen how to perform speech enhancement on the noise generated by the robot itself, so that the subsequent speech recognition is more accurate, which becomes a problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a method, an apparatus, and a robot for speech enhancement, and a storage medium, which are used to solve the problem that a user's speech cannot be accurately extracted.

According to an embodiment of an aspect of the present disclosure, there is provided a speech enhancement method, including: acquiring audio information of the robot; generating voice existence probability including voice in the audio information according to the audio information; and enhancing the audio information according to the voice existence probability to generate enhanced voice.

In an embodiment of the present disclosure, the enhancing the audio information according to the speech existence probability to generate an enhanced speech includes: adjusting a gain function of an enhancement algorithm according to the voice existence probability; and generating the audio information into enhanced voice through the adjusted enhancement algorithm.

In an embodiment of the disclosure, the generating, according to the audio information, a speech existence probability including speech in the audio information includes: performing noise elimination on the audio information to generate de-noised audio information; and generating the voice existence probability including voice in the audio information according to the audio information and the de-noising audio information.

In an embodiment of the disclosure, the denoising the audio information to generate denoised audio information includes: inputting the audio information into a neural network model to generate the de-noised audio information, wherein the neural network model is obtained according to noise training generated by the robot.

In an embodiment of the present disclosure, the generating, from the audio information and the de-noised audio information, a speech existence probability including speech in the audio information includes: generating a first characteristic value according to the audio information, and generating a second characteristic value according to the denoising audio information; and generating the voice existence probability according to the first characteristic value and the second characteristic value.

In an embodiment of the disclosure, the generating a first feature value according to the audio information and generating a second feature value according to the de-noised audio information includes: performing a root mean square operation on the audio information to generate the first characteristic value; and performing root mean square operation on the de-noised audio information to generate the second characteristic value.

In an embodiment of the disclosure, the generating the speech existence probability according to the first feature value and the second feature value includes: judging whether the first characteristic value and the second characteristic value are smaller than a first preset threshold value or not; and if the first characteristic value and the second characteristic value are both smaller than the first preset threshold value, judging that the audio information is noise.

In one embodiment of the present disclosure, the method further includes: if the first characteristic value or the second characteristic value is larger than or equal to the first preset threshold value, converting the first characteristic value and the second characteristic value into a first non-negative characteristic value and a second non-negative characteristic value; and generating the voice existence probability according to the first non-negative characteristic value and the second non-negative characteristic value.

In an embodiment of the disclosure, the generating the speech presence probability according to the first non-negative eigenvalue and the second non-negative eigenvalue includes: obtaining a difference value generated by subtracting the second non-negative characteristic value from the first non-negative characteristic value; if the difference value is smaller than a second preset threshold value, judging that the audio information comprises voice; and if the difference is greater than or equal to the second preset threshold, generating the voice existence probability according to the difference and the second non-negative characteristic value.

In an embodiment of the disclosure, the generating the speech existence probability according to the difference value and the second non-negative eigenvalue includes: dividing the difference by the second non-negative eigenvalue to generate a probability that the audio information is noise according to the difference; and generating the voice existence probability according to the probability that the audio information is noise.

According to still another aspect of the embodiments of the present disclosure, there is also provided a speech enhancement apparatus, including: the audio acquisition module is used for acquiring audio information acquired by the robot; the probability generation module is used for generating the voice existence probability including voice in the audio information according to the audio information; and the enhancement module is used for enhancing the audio information according to the voice existence probability so as to generate enhanced voice.

In one embodiment of the disclosure, the enhancement module includes: a gain function adjusting submodule for adjusting a gain function of an enhancement algorithm according to the voice existence probability; and the enhancement submodule is used for generating the audio information into enhanced voice through the adjusted enhancement algorithm.

In one embodiment of the disclosure, the probability generation module includes: a denoising submodule for denoising the audio information to generate denoised audio information; and a probability generation submodule, configured to generate a speech existence probability including speech in the audio information according to the audio information and the denoising audio information.

In an embodiment of the disclosure, the denoising submodule inputs the audio information into a neural network model to generate the denoised audio information, wherein the neural network model is obtained according to noise training of the robot.

In one embodiment of the present disclosure, the probability generation submodule includes: the characteristic value generating unit is used for generating a first characteristic value according to the audio information and generating a second characteristic value according to the denoising audio information; and a probability generating unit, configured to generate the speech existence probability according to the first feature value and the second feature value.

In an embodiment of the present disclosure, the eigenvalue generation unit performs a root mean square operation on the audio information to generate the first eigenvalue, and performs a root mean square operation on the denoised audio information to generate the second eigenvalue.

In an embodiment of the disclosure, the probability generating unit determines that the audio information is noise when both the first characteristic value and the second characteristic value are smaller than a first preset threshold.

In an embodiment of the present disclosure, the probability generating unit converts the first feature value and the second feature value into a first non-negative feature value and a second non-negative feature value when the first feature value or the second feature value is greater than or equal to the first preset threshold, and generates the voice presence probability according to the first non-negative feature value and the second non-negative feature value.

In an embodiment of the disclosure, the probability generating unit obtains a difference value generated by subtracting the second non-negative characteristic value from the first non-negative characteristic value, determines that the audio information includes a voice when the difference value is smaller than a second preset threshold, and generates the voice existence probability according to the difference value and the second non-negative characteristic value when the difference value is greater than or equal to the second preset threshold.

In an embodiment of the present disclosure, the probability generating unit generates the probability that the audio information is noise according to the dividing of the difference by the second non-negative eigenvalue, and generates the speech existence probability according to the probability that the audio information is noise.

According to yet another aspect of the embodiment of the present disclosure, there is also provided a robot including the apparatus as described above.

According to still another aspect of the present disclosure, there is also provided an electronic apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to yet another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to the embodiment of the disclosure, the voice existence probability of the voice is generated according to the audio information, and the audio information can be enhanced according to the voice existence probability when the voice enhancement is carried out, so that the enhancement effect is improved. In the embodiment of the disclosure, the accuracy of the speech enhancement algorithm can be improved through the speech existence probability, so that the speech enhancement effect is realized. In the embodiment of the present disclosure, since the effect of voice enhancement can be improved, a command for voice can be accurately extracted, thereby improving the response accuracy of the robot.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow diagram of a speech enhancement method according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for generating a probability of speech presence according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of generating a probability of speech presence according to one embodiment of the present disclosure;

FIG. 4 is a block diagram of a speech enhancement device according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of an electronic device for a method of speech enhancement according to one embodiment of the present disclosure.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present disclosure. In one embodiment of the present disclosure, the speech enhancement method may be run in a robot, such as a robotic pet, or a legged robot. Of course, in other embodiments of the present disclosure, the method may also be executed in the mobile terminal of the user or in the server. In one embodiment of the present disclosure, a legged robot includes multiple legs, such as a quadruped or biped robot, or other legged robots. The method comprises the following steps:

and step 110, acquiring audio information collected by the robot.

In one embodiment of the present disclosure, a microphone may be disposed on a body of a robot, such as a legged robot, and audio acquisition may be performed by the microphone, thereby generating audio information. In other embodiments of the present disclosure, a plurality of microphones, for example, six microphones, two microphones respectively disposed at the front and rear of the body, and four microphones respectively disposed at both sides of the body may be disposed on the body of the robot. Among the embodiments of the present disclosure, it is constantly moving for robots, particularly legged robots. Therefore, the legged robot may be far away from the user and not fixed in position, and a plurality of microphones with different orientations are provided to better receive the voice command of the user. Depending on the signal-to-noise ratio of each microphone, the microphone with the lowest signal-to-noise ratio may be selected as the speech input. A microphone with a low signal-to-noise ratio may be a microphone that is relatively close to the user or may be a microphone that is directed towards the user, so that the user's speech is received more clearly.

And step 130, generating a voice existence probability including voice in the audio information according to the audio information.

In one embodiment of the present disclosure, the speech existence probability is generated according to the audio information, so that a basis can be provided for subsequent speech enhancement.

And 150, enhancing the audio information according to the voice existence probability to generate enhanced voice.

In one embodiment of the present disclosure, the gain function of the speech enhancement algorithm may be adjusted by the speech presence probability, thereby improving the accuracy of the speech enhancement algorithm. Firstly, a gain function of an enhancement algorithm is adjusted according to the voice existence probability, and then the audio information is generated into enhanced voice through the adjusted voice enhancement algorithm.

In one embodiment of the present disclosure, the speech enhancement algorithm may be an OM-LSA (optimized-Modified Log-Spectral Amplitude) speech enhancement algorithm, although in other embodiments of the present disclosure, other speech enhancement algorithms may be used. Taking the OM-LSA speech enhancement algorithm as an example, the gain function of the OM-LSA speech enhancement algorithm can be updated by the following formula.

Where G (k, l) is the gain function and p (k, l) is the probability of speech presence.

Fig. 2 is a flowchart of a method for generating a speech existence probability according to an embodiment of the present disclosure. The recognition of the voice can be realized through the following steps, and the probability of the voice existing in the collected audio information can be generated.

And step 210, acquiring audio information collected by the robot.

In one embodiment of the present disclosure, among the audio collected by the robot, there may be a voice of the user and also there may be environmental noise, and since the robot is constantly moving, there may also be noise generated by the movement of the robot itself. For example, noise of a driving motor of the robot itself, noise of movement of a joint of the robot, noise generated by collision of a foot of the robot with the ground, and the like. If the robot is in a quiet state, detection can be performed by detecting features such as flatness of a speech spectrum, for example, a state change from the quiet state to occurrence of speech is detected, and thus judgment is achieved. However, for the robot, the noise generated by itself is not stationary, and since the motion state of the robot is not fixed, the noise of itself is bursty and belongs to bursty interference (i.e., transient interference), and such interference is very difficult to determine whether voice is included in the collected audio information for the voice detection scenario.

Step 230, performing speech enhancement on the audio information to generate de-noised audio information.

In one embodiment of the present disclosure, the audio information may be input into a neural network model to generate de-noised audio information, wherein the neural network model is derived from noise training of the robot itself. In the embodiments of the present disclosure, the structure of the Neural Network model is not limited, and Neural Network structures such as DNN (Deep Neural Network), RNN (Recurrent Neural Network), and CNN (Convolutional Neural Network) may be implemented.

In the embodiments of the present disclosure, although the noise of the robot itself is sudden as described above, the spectrum of the noise is fixed. For example, noise of motor driving in the robot, noise of joint rotation and collision noise generated when the foot part of the value-rising robot collides with the ground are basically fixed, so that the noises can be sampled and input into the neural network model as samples to be learned. In other embodiments of the present disclosure, although different grounds may generate different noises, the materials of the grounds are limited, so that different ground materials can be sampled, thereby perfecting the neural network model.

In the embodiment of the present disclosure, the noise elimination of the audio information to generate the de-noised audio information refers to that the spectrum of the noise of the robot, which is learned according to the neural network model, is filtered from the audio information, and since the noise is filtered, the noise is referred to as the de-noised audio information.

And step 250, generating the voice existence probability including voice in the audio information according to the audio information and the de-noising audio information.

In one embodiment of the present disclosure, since the audio information is subjected to noise (noise of the robot itself) removal by the neural network model, the denoised audio information can be obtained. And comparing the de-noised audio information with the audio information to obtain the voice existence probability of voice in the audio information. For example, if only noise exists in the audio information, the denoised audio information is very small (because most of the noise is filtered) relative to the audio information, and therefore it can be determined that the probability of existence of speech including speech in the audio information is very low, even zero, that is, both noise. Conversely, if the denoised audio information does not vary much from the audio information, it means that very little noise is removed and very much speech is left, so the probability of existence of speech including speech in the audio information is very high.

According to the embodiment of the disclosure, the voice enhancement can be performed on the audio information collected by the robot, so that the de-noised audio information is generated, the voice existence probability including voice in the audio information is generated according to the audio information and the de-noised audio information, and the accuracy of voice extraction is improved. By the method, the voice of the user can be effectively and accurately extracted from the audio information collected by the robot, so that a basis is provided for subsequent voice recognition and the like.

Fig. 3 is a flowchart of a method for generating a speech existence probability according to an embodiment of the present disclosure. The embodiment is only one method for realizing the speech existence probability recognition, and those skilled in the art can propose other speech existence probability recognition methods with reference to the method, which should be covered by the protection scope of the present invention.

Step 310, generating a first characteristic value according to the audio information, and generating a second characteristic value according to the de-noised audio information

In one embodiment of the present disclosure, a root mean square operation is performed on the audio information to generate a first eigenvalue, and a root mean square operation is performed on the denoised audio information to generate a second eigenvalue. In other embodiments of the present disclosure, the root mean square is only one implementation manner, and the measurement of the audio information and the denoised audio information may also be implemented by other operations.

And step 330, generating a voice existence probability according to the first characteristic value and the second characteristic value.

In one embodiment of the present disclosure, it may be determined whether the first characteristic value and the second characteristic value are less than a first preset threshold. And if the first characteristic value and the second characteristic value are both smaller than a first preset threshold value, judging that the audio information is noise. As described above, if both the first eigenvalue and the second eigenvalue are smaller than the first preset threshold, it means that both the audio information and the de-noised audio information are very small, and therefore the audio information is noise, that is, the speech existence probability is 0.

In another embodiment of the present disclosure, if the first characteristic value or the second characteristic value is greater than or equal to the first preset threshold, it indicates that there may be voice in the audio information, and therefore further determination is needed. Specifically, the first eigenvalue and the second eigenvalue are converted into a first nonnegative eigenvalue and a second nonnegative eigenvalue, and then the speech existence probability is generated according to the first nonnegative eigenvalue and the second nonnegative eigenvalue. In one embodiment of the present disclosure, since the first characteristic value and the second characteristic value are negative values, it is necessary to convert both to a non-negative integer domain. In one embodiment of the present disclosure, the first characteristic value may be added to a preset value to generate a first non-negative characteristic value, and the second characteristic value may be added to a preset value to generate a second non-negative characteristic value. In one embodiment of the present disclosure, the preset value may be 100. Of course, in other embodiments of the present disclosure, a larger or smaller preset value may be selected as desired.

In an embodiment of the disclosure, the first non-negative characteristic value may be subtracted from the second non-negative characteristic value to generate a difference value, and then the determination may be performed according to the difference value. If the difference is smaller than a second preset threshold (for example, 0), it is determined that the audio information includes speech, that is, the speech existence probability is 1. In this embodiment, the audio information is predominantly speech if the second non-negative characteristic is greater than the first non-negative characteristic. Otherwise, if the difference is greater than or equal to a second preset threshold, the speech existence probability can be generated according to the difference and a second non-negative eigenvalue. Specifically, the difference is divided by the second non-negative eigenvalue to generate the probability that the audio information is noise, and then the speech existence probability is generated according to the probability that the audio information is noise, that is, the probability that the audio information is noise is subtracted from 1, so that the speech existence probability can be obtained.

In one embodiment of the present disclosure, after the probability that the audio information is noise is generated, boundary detection is also required, and the probability that the audio information is noise is limited within a range of 0 to 1. For example, if the probability of detecting the audio information as noise is less than 0, it is set to 0, whereas if the probability of detecting the audio information as noise is greater than 1, it is set to 1, so that the extreme case can be filtered out, and the accuracy can be improved.

Fig. 4 is a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure. The speech enhancement apparatus 400 includes an audio acquisition module 410, a probability generation module 420, and an enhancement module 430. The audio obtaining module 410 is used for obtaining audio information collected by the robot. The probability generating module 420 is configured to generate a speech existence probability including speech in the audio information according to the audio information. The enhancement module 430 is configured to enhance the audio information according to the speech presence probability to generate an enhanced speech.

In one embodiment of the present disclosure, the enhancement module 430 includes a gain function adjustment submodule 431 and an enhancer module 432. The gain function adjustment submodule 431 is configured to adjust a gain function of the enhancement algorithm according to the speech presence probability. The enhancement submodule 432 is configured to generate an enhanced speech from the audio information by the adjusted enhancement algorithm.

In one embodiment of the present disclosure, the probability generation module 420 includes a denoising submodule 421 and a probability generation submodule 422. The denoising submodule 421 is configured to perform noise cancellation on the audio information to generate denoised audio information. The probability generation sub-module 422 is configured to generate a speech existence probability including speech in the audio information according to the audio information and the de-noised audio information.

In one embodiment of the present disclosure, the denoising submodule 421 inputs the audio information into a neural network model to generate denoised audio information, wherein the neural network model is obtained according to the noise training of the robot itself.

In one embodiment of the present disclosure, the probability generation submodule 422 includes a feature value generation unit and a probability generation unit. The characteristic value generating unit is used for generating a first characteristic value according to the audio information and generating a second characteristic value according to the de-noised audio information. The probability generating unit is used for generating the voice existence probability according to the first characteristic value and the second characteristic value.

In an embodiment of the present disclosure, the eigenvalue generation unit performs a root mean square operation on the audio information to generate a first eigenvalue, and performs a root mean square operation on the denoised audio information to generate a second eigenvalue.

In an embodiment of the disclosure, the probability generation unit determines that the audio information is noise when both the first characteristic value and the second characteristic value are smaller than a first preset threshold.

In one embodiment of the present disclosure, the probability generation unit converts the first feature value and the second feature value into a first non-negative feature value and a second non-negative feature value when the first feature value or the second feature value is greater than or equal to a first preset threshold, and generates the voice presence probability according to the first non-negative feature value and the second non-negative feature value.

In an embodiment of the disclosure, the probability generating unit obtains a difference value generated by subtracting the second non-negative characteristic value from the first non-negative characteristic value, determines that the audio information includes voice when the difference value is smaller than a second preset threshold, and generates the voice existence probability according to the difference value and the second non-negative characteristic value when the difference value is greater than or equal to the second preset threshold.

In one embodiment of the present disclosure, the probability generation unit generates the probability that the audio information is noise according to the division of the difference by the second non-negative eigenvalue, and generates the speech existence probability according to the probability that the audio information is noise.

In one embodiment of the disclosure, a robot is also disclosed, comprising the apparatus as described above.

In an embodiment of the present disclosure, there is also disclosed an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above. The electronic device may be a robot, such as a legged robot, a mobile terminal of a user, a server, or other similar devices.

In one embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above is also presented.

In addition, according to the embodiment of the disclosure, the voice existence probability of the existing voice is generated according to the audio information, and the audio information can be enhanced according to the voice existence probability when voice enhancement is performed, so that the enhancement effect is improved. In the embodiment of the disclosure, the accuracy of the speech enhancement algorithm can be improved through the speech existence probability, so that the speech enhancement effect is realized. In the embodiment of the present disclosure, since the effect of voice enhancement can be improved, a command for voice can be accurately extracted, thereby improving the response accuracy of the robot.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic device includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor Y01 is taken as an example.

Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech enhancement methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech enhancement method provided by the present application.

The memory Y02 is a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the speech enhancement methods in the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the voice enhancement method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device of the voice enhancement method, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory Y02 may optionally include a memory located remotely from the processor Y01, and these remote memories may be connected to the speech enhancement method electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the speech enhancement method may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the speech enhancement method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device (e.g., LED), a tactile feedback device (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short).

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of speech enhancement, comprising:

acquiring audio information collected by a robot;

generating voice existence probability including voice in the audio information according to the audio information;

and enhancing the audio information according to the voice existence probability to generate enhanced voice.

2. The method of claim 1, wherein said enhancing the audio information according to the speech presence probability to generate enhanced speech, comprises:

adjusting a gain function of an enhancement algorithm according to the voice existence probability;

and generating the audio information into enhanced voice through the adjusted enhancement algorithm.

3. The method of claim 1, wherein the generating a probability of existence of speech, including speech, among the audio information from the audio information comprises:

performing noise elimination on the audio information to generate de-noised audio information; and

and generating the voice existence probability including voice in the audio information according to the audio information and the de-noising audio information.

4. The method of claim 3, wherein said denoising the audio information to generate denoised audio information, comprises:

inputting the audio information into a neural network model to generate the de-noised audio information, wherein the neural network model is obtained according to noise training generated by the robot.

5. The method of claim 3, wherein generating the probability of existence of speech, including speech, among the audio information from the audio information and the de-noised audio information comprises:

generating a first characteristic value according to the audio information, and generating a second characteristic value according to the de-noising audio information;

and generating the voice existence probability according to the first characteristic value and the second characteristic value.

6. The method of claim 5, wherein generating a first eigenvalue from the audio information and a second eigenvalue from the de-noised audio information comprises:

performing a root mean square operation on the audio information to generate the first characteristic value; and

performing a root mean square operation on the denoised audio information to generate the second eigenvalue.

7. The method of claim 4, wherein the generating the speech presence probability based on the first feature value and the second feature value comprises:

judging whether the first characteristic value and the second characteristic value are smaller than a first preset threshold value or not;

and if the first characteristic value and the second characteristic value are both smaller than the first preset threshold value, judging that the audio information is noise.

8. The method of claim 7, further comprising:

if the first characteristic value or the second characteristic value is larger than or equal to the first preset threshold value, converting the first characteristic value and the second characteristic value into a first non-negative characteristic value and a second non-negative characteristic value;

and generating the voice existence probability according to the first non-negative characteristic value and the second non-negative characteristic value.

9. The method of claim 8, wherein the generating the speech presence probability based on the first non-negative eigenvalue and the second non-negative eigenvalue comprises:

obtaining a difference value generated by subtracting the second non-negative characteristic value from the first non-negative characteristic value;

if the difference value is smaller than a second preset threshold value, judging that the audio information comprises voice;

and if the difference is greater than or equal to the second preset threshold, generating the voice existence probability according to the difference and the second non-negative characteristic value.

10. The method of claim 9, wherein generating the speech presence probability based on the difference value and the second non-negative eigenvalue comprises:

dividing the difference by the second non-negative eigenvalue to generate a probability that the audio information is noise according to the difference;

and generating the voice existence probability according to the probability that the audio information is noise.

11. A speech enhancement apparatus, comprising:

the audio acquisition module is used for acquiring audio information acquired by the robot;

the probability generation module is used for generating the voice existence probability including voice in the audio information according to the audio information;

and the enhancement module is used for enhancing the audio information according to the voice existence probability so as to generate enhanced voice.

12. The apparatus of claim 11, wherein the boost module comprises:

a gain function adjusting submodule for adjusting a gain function of an enhancement algorithm according to the voice existence probability;

and the enhancement submodule is used for generating the audio information into enhanced voice through the adjusted enhancement algorithm.

13. The apparatus of claim 11, wherein the probability generation module comprises:

a denoising submodule, configured to perform noise cancellation on the audio information to generate denoised audio information; and

and the probability generation submodule is used for generating the voice existence probability including voice in the audio information according to the audio information and the denoising audio information.

14. The apparatus of claim 13, wherein the de-noising sub-module inputs the audio information into a neural network model to generate the de-noised audio information, wherein the neural network model is derived from noise training of the robot itself.

15. The apparatus of claim 13, wherein the probability generation submodule comprises:

the characteristic value generating unit is used for generating a first characteristic value according to the audio information and generating a second characteristic value according to the denoising audio information;

and a probability generating unit, configured to generate the speech existence probability according to the first feature value and the second feature value.

16. The apparatus as claimed in claim 15, wherein the eigenvalue generation unit performs a root mean square operation on the audio information to generate the first eigenvalue, and performs a root mean square operation on the denoised audio information to generate the second eigenvalue.

17. The apparatus according to claim 14, wherein the probability generation unit determines that the audio information is noise when both the first feature value and the second feature value are smaller than a first preset threshold.

18. The apparatus of claim 17, wherein the probability generation unit converts the first eigenvalue and the second eigenvalue into a first non-negative eigenvalue and a second non-negative eigenvalue when the first eigenvalue or the second eigenvalue is greater than or equal to the first preset threshold, and generates the speech presence probability according to the first non-negative eigenvalue and the second non-negative eigenvalue.

19. The apparatus according to claim 18, wherein the probability generating unit obtains a difference value generated by subtracting the second non-negative eigenvalue from the first non-negative eigenvalue, determines that the audio information includes speech when the difference value is smaller than a second preset threshold, and generates the speech existence probability according to the difference value and the second non-negative eigenvalue when the difference value is greater than or equal to the second preset threshold.

20. The apparatus of claim 19, wherein the probability generation unit generates the probability that the audio information is noise based on the dividing the difference by the second non-negative eigenvalue, and generates the speech presence probability based on the probability that the audio information is noise.

21. A robot, characterized in that it comprises a device according to any of claims 11-20.

22. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

23. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.