CN110942779A

CN110942779A - Noise processing method, device and system

Info

Publication number: CN110942779A
Application number: CN201911106466.9A
Authority: CN
Inventors: 吴科苇; 刘兵兵; 刘如意; 王峰; 车洋
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-03-31
Also published as: WO2021093380A1; CA3160740A1

Abstract

The embodiment of the application discloses a noise processing method, a device and a system, wherein the method comprises the following steps: detecting the collected audio information; when the voice information is detected, filtering the voice information according to pre-stored audio information of a target user; judging whether voice information exists after filtering processing; and if so, recognizing the voice information after the filtering processing and performing corresponding feedback according to a recognition result. This application regards the pronunciation of the target person who acquires as prior information, consequently when non-target person sends the instruction, according to this prior information to can suppress non-target person's instruction, when target person has other voice interference, when the environmental noise in the time of sending the instruction, can suppress same position, similar position, the voice interference and the environmental noise of distant place according to this prior information, thereby obtain not containing other voice, the instruction of environmental noise, the definition of target person's sound has been improved, interactive experience has been improved.

Description

Noise processing method, device and system

Technical Field

The invention belongs to the field of acoustics, and particularly relates to a noise processing method, device and system.

Background

With the development of artificial intelligence, more and more living environments will embody more intellectualization, such as vehicle-mounted environments, home environments, classroom environments, meeting room environments and the like. Among the various intelligent devices employed in these environments, intelligent voice interaction devices play an important role. The intelligent voice interaction device realizes the voice interaction between human beings and the device, so that the device can replace the human beings to perform some operations and controls according to the meanings of the human beings, the hands of the human beings are liberated as far as possible, and the intelligent voice interaction device is indispensable in the future.

Since the actual living environment is often complicated, there are many noises and disturbing sounds in addition to the target person's voice. These noises and interfering sounds are undesirable and their presence can severely interfere with human interaction with the speech device, reducing the interaction experience. To avoid the interference of these noises and interfering sounds, a microphone array is usually used to perform beam forming or blind source separation, enhance the sound in a specific direction to suppress the sound in other directions or separate the sound of a specific target person.

However, conventional beamforming or blind source separation is not effective in suppressing interference or separating target sounds in all environments. When the disturbing sound is also human voice and is close to the direction of the target sound or at the same direction or far away from the target sound, the effect of the above-mentioned method is reduced sharply.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a noise processing method, a noise processing device and a noise processing system. The method can not only solve the interference of environmental noise, but also solve the interference of human voice in the same direction, close direction and far direction, and improve the interactive experience between people and equipment.

The embodiment of the invention provides the following specific technical scheme:

in a first aspect, the present invention provides a method for processing noise, the method comprising:

detecting the collected audio information;

when voice information is detected, filtering the voice information according to pre-stored audio information of a target user;

judging whether voice information exists after filtering processing;

and if so, recognizing the voice information after the filtering processing and performing corresponding feedback according to a recognition result.

Preferably, the filtering the voice information according to the pre-stored audio information of the target user specifically includes:

constructing an acoustic model; the acoustic model is a Gaussian mixture model, and the variable of the acoustic model is the voice information, and the initial value of the parameter is a covariance matrix obtained by calculating the audio information of the target user;

modifying parameters of the acoustic model according to an EM algorithm;

judging whether the iteration times of the EM algorithm reach a preset value or not;

when the acoustic model is reached, acquiring an output result of the acoustic model;

and filtering the voice information according to the output result.

Preferably, when the voice information is detected, the method further comprises:

and carrying out echo cancellation on the voice information.

Preferably, the method further comprises:

sending an operation instruction to the target user according to the received request sent by the target user;

receiving audio information sent by the target user according to the operation instruction;

and storing the audio information sent by the target user according to the operation instruction.

Preferably, the algorithm for detecting the collected audio information includes any one of a pitch detection algorithm, a double-threshold method, and an a posteriori snr frequency domain iterative algorithm.

In a second aspect, the present invention provides a noise processing apparatus, the apparatus comprising:

the detection module is used for detecting the acquired audio information;

the analysis module is used for carrying out filtering processing on the voice information according to pre-stored audio information of a target user when the voice information is detected;

the judging module is used for judging whether voice information exists after filtering processing;

and the recognition module is used for recognizing the filtered voice information and performing corresponding feedback according to a recognition result when the voice information exists.

Preferably, the analysis module specifically includes:

the construction module is used for constructing an acoustic model; the acoustic model is a Gaussian mixture model, and the variable of the acoustic model is the voice information, and the initial value of the parameter is a covariance matrix obtained by calculating the audio information of the target user;

the correction module is used for correcting the parameters of the acoustic model according to an EM algorithm;

the processing module is used for judging whether the iteration times of the EM algorithm reach a preset value or not; when the acoustic model is reached, acquiring an output result of the acoustic model; and filtering the voice information according to the output result.

Preferably, the analysis module further comprises:

and the echo cancellation module is used for performing echo cancellation on the voice information when the voice information is detected.

Preferably, the apparatus further comprises a storage module configured to:

In a third aspect, the present invention provides a computer system comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

detecting the collected audio information;

judging whether voice information exists after filtering processing;

The embodiment of the invention has the following beneficial effects:

according to the invention, the voice of the target person is firstly acquired as the prior information, so that when the non-target person sends an instruction, the instruction of the non-target person can be inhibited according to the prior information, and when other voice interference and environmental noise exist while the target person sends the instruction, the voice interference and the environmental noise in the same direction, the similar direction and the distant position can be inhibited according to the prior information, so that the instruction without other voice and environmental noise is obtained, the definition of the voice of the target person is improved, and the interactive experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an application environment diagram of a noise processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a noise processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a noise processing apparatus according to a second embodiment of the present application;

fig. 4 is a schematic position diagram of a noise processing apparatus and an experimental user according to a second embodiment of the present application;

fig. 5 is a diagram of a computer system architecture according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application provides a noise processing method, which can be applied to the application environment shown in fig. 1. The server 12 communicates with the database 11 and the terminal 13 via a network. The terminal 13 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 12 may be implemented by an independent server or a server cluster formed by a plurality of servers.

Example one

As shown in fig. 2, the present application provides a noise processing method, which specifically includes the following steps:

and S21, detecting the collected audio information.

The detection algorithm may include any one of a pitch detection algorithm, a double-threshold method, and an a posteriori snr frequency domain iterative algorithm.

In addition, the method can also be any other algorithm which can realize the detection of the voice breakpoint, and the selection of the algorithm is not limited by the scheme.

And S22, when the voice information is detected, filtering the voice information according to the pre-stored audio information of the target user.

When voice information is detected, the method further comprises the following steps:

and carrying out echo cancellation on the voice information.

The echo in this embodiment refers to an acoustic echo, and when performing echo cancellation, the echo can be implemented by an acoustic echo cancellation method commonly used in the art, such as an echo suppression algorithm or an acoustic echo cancellation algorithm, which is not limited in this invention.

Wherein, the detected voice information comprises environmental noise and/or human voice interference noise.

The filtering processing of the voice information according to the pre-stored audio information of the target user specifically comprises the following steps:

1. constructing an acoustic model;

the acoustic model is a Gaussian mixture model, the variable of the acoustic model is voice information, and the initial value of the parameter is a covariance matrix obtained by calculating the audio information of the target user;

the Gaussian Mixture Model (GMM) can be represented by the following equation:

wherein x is voice information, N (x | mu)_kΣ k) is the component of the kth in the model; pi_kIs the mixing coefficient, i.e. the weight of each component; pi_k、μ_kΣ k is a parameter of the gaussian mixture model, and the initial value thereof is a covariance matrix obtained by calculating the audio information of the target user;

2. modifying parameters of the acoustic model according to an EM algorithm;

wherein the EM algorithm is a maximum expectation algorithm.

The step 2 specifically includes the following two substeps:

a. calculating posterior probability according to the initial value of the current parameter;

b. and correcting the parameters according to the posterior probability.

3. Judging whether the iteration times of the EM algorithm reach a preset value or not;

in the scheme, the iteration times are set according to an empirical value, and when the execution times of the EM algorithm (the execution times of the steps a and b) reach a preset value, the iteration is ended.

4. When the acoustic model is reached, obtaining an output result of the acoustic model;

the output result is the posterior probability calculated according to the parameters in the last iteration.

5. And filtering the voice information according to the output result.

Therefore, the interference of environmental noise, the voice of the same direction, the voice of the near direction and the voice of the far direction can be effectively inhibited.

And S23, judging whether voice information exists after the filtering processing.

When the detected voice information does not exist, the detected voice information is the voice sent by the non-target user; when present, it indicates that the detected voice information includes voice uttered by the target user.

And S24, if the voice information exists, recognizing the voice information after the filtering processing and feeding back the voice information correspondingly according to the recognition result.

Specifically, the voice information after filtering processing is converted into text content, the intention of the user is judged by means of word segmentation technology and the like, corresponding feedback is carried out, and meanwhile, an evaluation index is output and used for evaluating the accuracy of the voice recognition process.

The evaluation index may be a Sentence Error Rate (SER), a sentence accuracy rate (s.corr), a word error rate (WER/CER), or the like.

In addition, the acquisition of the pre-stored audio information of the target user comprises the following steps:

1. sending an operation instruction to a target user according to a received request sent by the target user;

the method can be applied to intelligent voice interaction equipment, so that the request sent by the target user can be a reset request for the equipment. According to the request sent by the target user, sending an operation instruction to the target user, such as:

voice prompt: please adjust the sitting posture, please say that the head is Biu small Biu, please incline the head to the left by about 10cm, then say that the head is Biu small Biu, please incline the head to the right by about 10cm, then say that the head is Biu small Biu, please incline the body forward by about 10cm, then say that the head is Biu small Biu, etc.

2. Receiving audio information sent by a target user according to an operation instruction;

the target user sends corresponding audio information according to the operation instruction, for example, when receiving the operation instruction of 'please adjust sitting posture', replying: "sitting posture adjusted"; when an operation command of 'please say Xiao Biu Xiao Biu', replying to 'Xiao Biu Xiao Biu'; when an operation command of 'please incline the head to the left by about 10cm and then say a small Biu small Biu' is received, corresponding actions are continuously carried out according to the command and the reply is carried out.

When there are a plurality of operation commands, the operation commands are transmitted at set time intervals.

Such as: and sending an operation instruction every 2 s.

3. And storing the audio information sent by the target user according to the operation instruction.

Such as: the voices such as 'adjusted sitting posture', 'small Biu and Biu' returned by the target user are stored.

Example two

As shown in fig. 3, the present application provides a noise processing apparatus, which specifically includes:

the detection module 31 is used for detecting the acquired audio information;

the analysis module 32 is configured to, when voice information is detected, perform filtering processing on the voice information according to pre-stored audio information of a target user;

a judging module 33, configured to judge whether there is voice information after the filtering processing;

and the recognition module 34 is configured to, when the voice information exists, recognize the filtered voice information and perform corresponding feedback according to a recognition result.

Preferably, the analysis module 32 specifically includes:

a construction module 321, configured to construct an acoustic model; the acoustic model is a Gaussian mixture model, the variable of the acoustic model is voice information, and the initial value of the parameter is a covariance matrix obtained by calculating the audio information of the target user;

a modification module 322, configured to modify parameters of the acoustic model according to an EM algorithm;

the processing module 323 is used for judging whether the iteration times of the EM algorithm reach a preset value; when the acoustic model is reached, obtaining an output result of the acoustic model; and filtering the voice information according to the output result.

Preferably, the analysis module 32 further includes:

and an echo cancellation module 324, configured to perform echo cancellation on the voice information when the voice information is detected.

The apparatus further comprises a storage module 35 configured to:

sending an operation instruction to a target user according to a received request sent by the target user;

receiving audio information sent by a target user according to an operation instruction;

When the noise processing apparatus is an intelligent interactive device, the intelligent interactive device includes a voice interactive system and a voice recognition system, wherein the voice interactive system includes the detection module 31, the analysis module 32, the determination module 33, and the storage module 35, and the voice recognition system includes the recognition module 34.

And carrying out an interaction experiment by using the intelligent interaction equipment, and arranging the users according to preset positions.

Referring to fig. 4, fig. 4 includes 5 users, which are user number 1, user number 2, user number 3, user number 4, and user number 5, respectively.

The experimental process is as follows:

1. the number 1 user and the number 2 user speak simultaneously, and the number 1 is a target user;

2. the number 1 user and the number 3 user speak simultaneously, and the number 1 is a target user;

3. the number 1 user and the number 4 user speak simultaneously, and the number 1 is a target user;

4. the No. 1 user and the No. 5 user speak simultaneously, and the No. 1 user is a target user;

5. the No. 1, No. 2 and No. 3 users speak at the same time, and the No. 1 is a target user;

6. the No. 1, No. 3 and No. 4 users speak at the same time, and the No. 1 is a target user;

7. the No. 1, No. 4 and No. 5 users speak at the same time, and the No. 1 is a target user;

8. all users speak simultaneously, and the number 1 is the target user.

The recognition module 34 in the speech recognition system is configured to recognize the filtered speech information and perform corresponding feedback according to a recognition result; and the method is also used for outputting an evaluation index for evaluating the accuracy of the voice recognition process.

In this embodiment, the evaluation index is WER (word error rate).

The results of the above experiment are shown in table 1 below:

TABLE 1

In addition, the gaussian mixture model uses an EM algorithm to correct the parameters, and obtains the optimal parameters through a self-adaptive algorithm during correction.

According to the experimental result, the audio information of the target user is used as the prior information, so that the subsequent voice recognition effect can be improved, and the interactive experience is improved.

EXAMPLE III

As shown in fig. 5, a third embodiment of the present application provides a computer system, including:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read and executed by the one or more processors, perform the following:

detecting the collected audio information;

when the voice information is detected, filtering the voice information according to pre-stored audio information of a target user;

judging whether voice information exists after filtering processing;

Fig. 5 illustrates an architecture of a computer system that may include, in particular, a processor 52, a video display adapter 54, a disk drive 56, an input/output interface 58, a network interface 510, and a memory 512. The processor 52, video display adapter 54, disk drive 56, input/output interface 58, network interface 510, and memory 512 may be communicatively coupled via a communication bus 514.

The processor 52 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the present Application.

The Memory 512 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 512 may store an operating system 516 for controlling the operation of the computer system 50, a Basic Input Output System (BIOS)518 for controlling low-level operations of the computer system. In addition, a web browser 520, a data storage management system 522, and the like may also be stored. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 512 and called to be executed by the processor 52.

The input/output interface 58 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 510 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The communication bus 514 includes a path to transfer information between the various components of the device, such as the processor 52, the video display adapter 54, the disk drive 56, the input/output interface 58, the network interface 510, and the memory 512.

In addition, the computer system can also obtain the information of specific receiving conditions from the virtual resource object receiving condition information database for condition judgment and the like.

It should be noted that although the above-described device only shows the processor 52, the video display adapter 54, the disk drive 56, the input/output interface 58, the network interface 510, the memory 512, the communication bus 514, etc., in a specific implementation, the device may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention. In addition, the noise processing apparatus, the computer system and the noise processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of noise processing, the method comprising:

detecting the collected audio information;

judging whether voice information exists after filtering processing;

2. The method according to claim 1, wherein the filtering the speech information according to the pre-stored audio information of the target user specifically comprises:

modifying parameters of the acoustic model according to an EM algorithm;

and filtering the voice information according to the output result.

3. The method of claim 1, wherein when voice information is detected, the method further comprises:

and carrying out echo cancellation on the voice information.

4. The method according to any one of claims 1 to 3, further comprising:

5. The method according to any one of claims 1 to 3, wherein the algorithm for detecting the collected audio information comprises any one of a pitch detection algorithm, a double threshold method, and a posteriori SNR frequency domain iterative algorithm.

6. A noise processing apparatus, characterized in that the apparatus comprises:

the detection module is used for detecting the acquired audio information;

7. The apparatus according to claim 6, wherein the analysis module specifically comprises:

8. The apparatus of claim 6, wherein the analysis module further comprises:

9. The apparatus according to any one of claims 6 to 9, further comprising a storage module configured to:

10. A computer system, comprising:

one or more processors; and

detecting the collected audio information;

judging whether voice information exists after filtering processing;