CN112634921B - Voice processing method, device and storage medium - Google Patents

Voice processing method, device and storage medium Download PDF

Info

Publication number
CN112634921B
CN112634921B CN201910955242.9A CN201910955242A CN112634921B CN 112634921 B CN112634921 B CN 112634921B CN 201910955242 A CN201910955242 A CN 201910955242A CN 112634921 B CN112634921 B CN 112634921B
Authority
CN
China
Prior art keywords
voice data
noise
data
voice
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910955242.9A
Other languages
Chinese (zh)
Other versions
CN112634921A (en
Inventor
高星
赵立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN201910955242.9A priority Critical patent/CN112634921B/en
Publication of CN112634921A publication Critical patent/CN112634921A/en
Application granted granted Critical
Publication of CN112634921B publication Critical patent/CN112634921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses a voice processing method, a device and a storage medium, wherein the method is applied to a voice processing system, and comprises the following steps: acquiring first voice data, wherein the first voice data comprises target voice data and background noise data; determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. By the embodiment, the quality of voice recognition can be improved.

Description

Voice processing method, device and storage medium
Technical Field
The present invention relates to the field of communications, and in particular, to a method, an apparatus, and a storage medium for processing speech.
Background
With the rapid development of mobile communication technology, the voice recognition technology is rapidly promoted in various industries, for example, in industries such as banks or financial institutions, the voice recognition technology can realize the quality inspection of background customer service, and can also convert collected customer service voice into characters, extract customer information therefrom, customize corresponding services for the customer information, and the like.
In the process of voice recognition technology, environmental noise in a practical application scene is usually mixed when target voice data is acquired, and the environmental noise generally comprises stationary noise (such as white noise) and non-stationary noise (speech sounds of surrounding people and automobile whistling sounds outside a window), so that noise reduction adjustment processing is firstly performed on the acquired voice data, and the processed target voice data is consistent with data in a voice recognition library as much as possible. In the current voice recognition technology, the suppression effect on non-stationary noise is limited, the removal effect on stationary noise is obvious, but voice distortion is easy to occur, noise residues are unnatural, and accordingly the quality of voice recognition is low.
Aiming at the technical problem of low quality of voice recognition in the prior art, no effective solution is proposed at present.
Disclosure of Invention
Embodiments of the present disclosure provide a voice processing method, apparatus, and storage medium, which can improve the quality of voice recognition.
In order to solve the technical problems, the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present disclosure provides a voice processing method, including:
acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; and
and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data.
In a second aspect, the disclosed embodiments also provide a storage medium, which includes a stored program, wherein the method according to the first aspect is performed by a processor when the program is run.
In a third aspect, there is also provided, according to an embodiment of the present disclosure, a speech processing apparatus, applied to a speech processing system, including:
the data acquisition module is used for acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
the noise removing module is used for determining a noise audio fragment only containing background noise in the first voice data, removing the noise audio fragment from the first voice data and generating second voice data; and
and the noise suppression module is used for performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data.
In a fourth aspect, embodiments of the present disclosure further provide a speech processing apparatus, applied to a speech processing system, including a processor; and
a memory, coupled to the first processor, for providing instructions to the first processor to process the following processing steps:
acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; and
and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data.
In the embodiment of the invention, a voice processing system acquires first voice data, wherein the first voice data comprises voice data and background noise data, a noise audio fragment only containing background noise is determined in the first voice data, and the noise audio fragment is removed from the first voice data to generate second voice data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and do not constitute an undue limitation on the disclosure. In the drawings:
FIG. 1 is a block diagram of a hardware architecture of a computing device for implementing a method according to embodiment 1 of the present disclosure;
FIG. 2 is a flow chart of a speech processing method according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of a speech processing method according to an embodiment of the disclosure;
fig. 4 is a schematic diagram of a speech processing device according to another embodiment of the disclosure.
Detailed Description
In order to better understand the technical solutions of the present disclosure, the following description will clearly and completely describe the technical solutions of the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. It will be apparent that the described embodiments are merely embodiments of a portion, but not all, of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure, shall fall within the scope of the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to the present embodiment, there is also provided an embodiment of a speech processing method, it being noted that the steps shown in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.
The method embodiments provided by the present embodiments may be performed in a mobile terminal, a computer terminal, a server, or similar computing device. FIG. 1 shows a block diagram of a hardware architecture of a computing device for implementing a speech processing method. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc., processing means), memory for storing data, and transmission means for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the embodiments of the present disclosure, the data processing circuit acts as a processor control (e.g., selection of the variable resistance termination path to interface with).
The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the speech processing methods in the embodiments of the present disclosure, and the processor executes the software programs and modules stored in the memory, thereby performing various functional applications and data processing, that is, implementing the speech processing methods of the application programs. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to the computing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of the computing device. In one example, the transmission means comprises a network adapter (Network Interface Controller, NIC) connectable to other network devices via the base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.
It should be noted herein that in some alternative embodiments, the computing device shown in FIG. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computing devices described above.
In the above-described operating environment, the present embodiment provides a speech processing method, which is implemented by a speech processing system. Fig. 2 is a flow chart of a voice processing method according to an embodiment of the disclosure, and referring to fig. 2, the method includes:
s202: acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
s204: determining a noise audio segment only containing background noise in the first voice data, and removing the noise audio segment from the first voice data to generate second voice data; and
s206: and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data.
In the embodiment of the invention, a voice processing system acquires first voice data, wherein the first voice data comprises voice data and background noise data, a noise audio fragment only containing background noise is determined in the first voice data, and the noise audio fragment is removed from the first voice data to generate second voice data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
In step S202, the speech processing system acquires first speech data, where the first speech data includes speech data and background noise data. The voice processing system can be applied to customer service processing scenes, and is used for acquiring first voice data of customer service personnel to check the service quality of the customer service personnel, or acquiring first voice data of a customer, and performing voice recognition on the first voice data of the customer to obtain basic information of the customer, so that customized service is performed according to the information of the customer; the method can also be applied to an intelligent language translation scene, and the method is not particularly limited herein, and the method can be used for carrying out voice translation processing by acquiring the first voice data of the user.
The first voice data includes target voice data and background voice data, the acquired first voice data is voice data in which a plurality of voices are mixed together, in a voice recognition target voice data scene, for example, voice uttered by a customer service person is acquired in a customer service scene as target voice data, the background voice data is non-target voice received while the target voice is acquired, including stationary noise (including noise emitted by machines such as a white noise, an air conditioner, a refrigerator and the like), and non-stationary noise (uttered by surrounding persons, car whistling sounds outside a window and the like), for example, in a service quality scene of a customer service person is checked, non-target voice of an environmental sound rotated by a fan is acquired while the voice uttered by the customer service person is acquired, that is, the environmental sound rotated by the fan is the background voice data.
In the above step S204, a noise audio segment including only background noise is determined in the first voice data, and the noise audio segment is removed from the first voice data, so as to generate second voice data. The first voice data is an audio segment acquired in a period of time, the data corresponding to each time of the audio segment is changed, for example, in the first voice data acquired for 10s, the target voice data only exist in the first 1s-5s, the first voice data corresponding to the 6s-10s in the remaining period of time only contain background noise, the noise audio segment only containing the background noise is determined, the noise audio segment is removed, and the remaining voice data are the second voice data.
In step S206, noise suppression processing for suppressing background noise data is performed on the second voice data, and third voice data is generated.
Optionally, determining a noise audio segment containing only background noise in the first speech data includes:
(a1) Dividing the first voice data into a plurality of audio clips according to a preset time period; and
(a2) And determining noise audio fragments in the plurality of audio fragments according to a threshold value of the preset voice parameters.
In the above-mentioned action (a 1), the voice processing system divides the first voice data into a plurality of audio segments according to a preset time period, wherein the preset time period may be 2s or 5s, and the method is not particularly limited herein, for example, a1 minute segment of the first voice data is divided into 30 audio segments according to a time period of 5 s.
In the above-mentioned action (a 2), the noise audio segment is determined in the plurality of audio segments according to a preset speech parameter threshold, and the speech parameter includes average energy of signals in the speech data, preset frequency energy and spectrum flatness, wherein the preset frequency energy is energy corresponding to when the frequency of the first speech data is smaller than a preset value. The language processing system calculates average energy in a period of time according to the energy corresponding to the audio fragment in the period of time; setting a preset frequency according to the low frequency characteristic of noise, wherein the preset frequency can be set to be 100Hz-600Hz, and can also be set to be the frequency of other values, the preset frequency energy is an energy value corresponding to the preset frequency, and the preset frequency energy is an energy value after normalization processing; spectral flatness is a characteristic in which the spectrum of noise is relatively flat with respect to the spectrum of speech, thereby distinguishing noise from speech. A preferable preset spectral flatness setting range is 0.01 to 0.5 according to the spectral characteristics of noise, and may be set to other values, without particular limitation. The preset frequency energy is the energy corresponding to the frequency of the first voice data is smaller than a preset value. In a preferred embodiment, the average energy, the preset frequency energy and the spectrum flatness are normalized, and the corresponding parameter threshold is set by using the normalized value.
Optionally, determining a noise audio segment from the plurality of audio segments according to a threshold of a preset speech parameter includes:
(b1) Respectively acquiring average energy, preset frequency energy and spectrum flatness in each audio fragment;
(b2) And determining the audio fragments corresponding to the average energy in the audio fragments being smaller than the first threshold, the preset frequency energy being smaller than the second threshold and the spectral flatness being larger than the third threshold as noise audio fragments.
In the above actions (b 1) and (b 2), the speech processing system obtains the average energy, the preset frequency energy and the spectrum flatness in each audio segment, and determines the audio segment corresponding to the average energy in the audio segment being smaller than the first threshold, the preset frequency energy being smaller than the second threshold and the spectrum flatness being larger than the third threshold as the noise audio segment. The first threshold is set according to the energy characteristic of the noise audio, and a preferred setting range of the first threshold is 400-600, and can also be set to other values, which are not limited in particular herein; the second threshold is set according to the energy characteristic of the noise audio, a preferable setting range of the second threshold is 100-300, or other setting ranges are possible, no special setting is made here, and in addition, based on the preset frequency energy being the energy corresponding to the frequency of the first voice data being smaller than the preset value, if the audio fragment does not contain the energy smaller than the preset frequency, the condition that the audio fragment does not meet the second threshold is judged; the third threshold is set according to the noise spectrum flatness, and a preferable setting range of the third threshold is 0.01 to 0.04, and may be set to other values, which are not particularly limited here. And determining the audio fragments which simultaneously meet the conditions that the average energy is smaller than a first threshold, the preset frequency energy is smaller than a second threshold and the spectrum flatness is larger than a third threshold as noise audio fragments.
In a specific embodiment, the speech processing system obtains average energy, energy corresponding to frequency and spectrum flatness in ten audio segments respectively, sorts the first to tenth audio segments according to a time sequence, the average energy values of the sorted first to tenth audio segments are 120, 150, 80, 70, 200, 90, 110 and 180 in sequence, and the first threshold is 100, and then the third, fourth and sixth audio segments meet the first threshold condition respectively; the energy of the first to tenth audio segments, which accords with the preset frequency and corresponds to the preset frequency, is (550 Hz, 410), (580 Hz, 450), (400 Hz, 300), (350 Hz, 480), (5000 Hz, 500), (500 Hz, 350), (510 Hz, 550), (300 Hz, 520), the preset frequency is 600Hz, and the second threshold is 400, and the third audio segment and the sixth audio segment respectively meet the second threshold condition; the spectral flatness of the first to tenth sections is 0.01, 0.07, 0.05, 0.01, 0.03, 0.06, 0.02, and the third threshold value is 0.04, and the second, third, and sixth audio sections satisfying the third threshold condition are respectively second, third, and sixth audio sections, and it is determined from the above results that the audio sections satisfying the average energy being smaller than the first threshold value, the preset frequency energy being smaller than the second threshold value, and the spectral flatness being greater than the third threshold value are the third audio section and the sixth audio section, and the third audio section and the sixth audio section are determined as noise audio sections.
Optionally, performing noise suppression processing for suppressing background noise data on the second voice data includes: the feature value of the second voice data is adjusted to be within a preset range so that the target voice data can be identified with respect to the background noise data.
In this embodiment, the feature value of the second voice data is adjusted to be within a preset range, so that the target voice data can be identified relative to the background noise data, and the setting within the preset range may be data training on the voice data related to the application scene of the second voice data according to the voice recognition technology in advance, so as to obtain a voice data model, and the feature value of the second voice data is adjusted to be within the preset range according to the feature value of the data model, so that the feature value of the second voice data is closer to the feature value in the voice data model, so as to improve the quality of voice recognition.
Optionally, adjusting the feature value of the second voice data to be within a preset range includes:
(c1) Gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain;
(c2) Superimposing preset white noise data to the second voice data after gain according to a preset voice recognition database;
in the above-mentioned action (c 1), the time domain amplitude of the second voice data is increased by the first proportional threshold according to the preset voice recognition database, so as to obtain the increased second voice data, where the voice recognition database may be a voice database obtained by performing data training through a voice recognition technology in advance according to an application scenario of the second voice data, or may be a voice database set in other manners, and no particular limitation is imposed here. The first proportional threshold may be set to 0.01 or 0.1, where no special limitation is imposed, and in a preferred embodiment, the setting range of the first proportional threshold is 0.01-0.5, the time domain amplitude of the second voice data is increased by the first proportional threshold, so as to obtain the increased second voice data, and by adjusting the time domain amplitude of the second voice data, the influence of the too large or too small time domain amplitude of the voice data at a certain moment on voice recognition is avoided, thereby improving the quality of voice recognition.
In the above-mentioned action (c 2), the preset white noise data is superimposed on the second speech data after the gain according to the preset speech recognition database, the preset speech recognition database is obtained according to the speech training data, the speech training data includes the white noise in the environment, that is, the preset speech recognition database also includes the white noise data, then the preset white noise data is superimposed on the second speech data after the gain according to the preset speech recognition database, the preset white noise data may be set to have a time domain amplitude of 0.03 of the white noise, or may have a time domain amplitude of other values of the white noise, or other characteristic data of the white noise, where the second speech data after the white noise is superimposed is the third speech data, and is closer to the speech data in the preset speech recognition database, and the speech recognition is performed on the third speech data according to the preset speech recognition module, thereby improving the speech recognition quality.
Optionally, the speech processing system comprises a dual microphone headset comprising:
(d1) And carrying out noise reduction processing on the first voice data through the double-microphone headset.
In the above-mentioned action (d 1), the voice processing system includes a dual-microphone headset, and noise suppression is performed on the first voice data by the dual-microphone headset, so that the primary microphone of the dual-microphone is close to the source of the target voice data, the secondary microphone of the dual-microphone is far from the target voice data relative to the primary microphone, and active noise reduction is performed according to the phase difference between the first voice data received by the two microphones, thereby improving the quality of voice recognition.
Further, when the dual-microphone headset performs voice noise reduction, the dual-microphone noise suppression may be performed when the first voice is received, or the dual-microphone noise suppression may be performed after the third voice data is obtained in the above noise reduction process, which is not particularly limited herein.
In the embodiment of the invention, a voice processing system acquires first voice data, wherein the first voice data comprises voice data and background noise data, a noise audio fragment only containing background noise is determined in the first voice data, and the noise audio fragment is removed from the first voice data to generate second voice data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium includes a stored program, wherein the speech processing method of any one of the above is executed by a processor when the program is run.
Thus, according to this embodiment, the speech processing system obtains first speech data, the first speech data including speech data and background noise data, determines a noise audio segment containing only background noise from the first speech data, and removes the noise audio segment from the first speech data to generate second speech data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
The storage medium provided in the embodiments of the present application can implement each process in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
Example 2
Fig. 3 is a schematic diagram of a voice processing method according to an embodiment of the disclosure, and the apparatus 300 corresponds to a voice processing method according to embodiment 1. Referring to fig. 3, the apparatus 300 includes:
a data acquisition module 301, configured to acquire first voice data, where the first voice data includes target voice data and background noise data;
a noise removing module 302, configured to determine a noise audio segment that only includes background noise in the first voice data, and remove the noise audio segment from the first voice data, so as to generate second voice data; and
and the noise suppression module 303 is configured to perform noise suppression processing for suppressing background noise data on the second voice data, and generate third voice data.
Optionally, the noise removal module 302 is specifically configured to:
dividing the first voice data into a plurality of audio fragments according to a preset time period; and
and determining the noise audio fragments in the plurality of audio fragments according to a threshold value of a preset voice parameter.
Optionally, the preset voice parameters include average energy, preset frequency energy and spectrum flatness, wherein the preset frequency energy is corresponding energy when the frequency of the first voice data is smaller than a preset value.
Optionally, the noise removal module 302 is further specifically configured to:
respectively acquiring average energy, preset frequency energy and spectrum flatness in each audio fragment;
and determining the audio segment corresponding to the average energy in the audio segment being smaller than a first threshold, the preset frequency energy being smaller than a second threshold and the spectral flatness being larger than a third threshold as a noise audio segment.
Optionally, the noise suppression module 303 is configured to:
and adjusting the characteristic value of the second voice data to be within a preset range so that the target voice data can be identified relative to the background noise data.
Optionally, the noise suppression module 303 is specifically configured to:
gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain;
and superposing preset white noise data on the second voice data after gain according to the preset voice recognition database.
Optionally, the system further comprises a voice recognition module:
and carrying out voice recognition on the third voice data according to a preset voice recognition module.
Optionally, the voice processing system includes a dual microphone headset, comprising:
and the noise reduction processing module is used for carrying out noise reduction processing on the first voice data through the double-microphone headset.
Thus, according to this embodiment, the speech processing system obtains first speech data, the first speech data including speech data and background noise data, determines a noise audio segment containing only background noise from the first speech data, and removes the noise audio segment from the first speech data to generate second speech data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
The voice processing method and device provided by the embodiment of the present application can implement each process in the foregoing method embodiment, and achieve the same functions and effects, which are not repeated here.
Example 3
Fig. 4 is a schematic diagram of a speech processing apparatus according to another embodiment of the present disclosure, and the apparatus 400 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 for processing the following processing steps: acquiring first voice data, wherein the first voice data comprises target voice data and background noise data; determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data.
Optionally, determining a noise audio segment containing only background noise in the first speech data includes: dividing the first voice data into a plurality of audio fragments according to a preset time period; and determining the noise audio segment from a plurality of audio segments according to a threshold value of a preset voice parameter.
Optionally, the preset voice parameters include average energy, preset frequency energy and spectrum flatness, wherein the preset frequency energy is corresponding energy when the frequency of the first voice data is smaller than a preset value.
Optionally, determining the noise audio segment from the plurality of audio segments according to a threshold of a preset speech parameter includes: respectively acquiring average energy, preset frequency energy and spectrum flatness in each audio fragment; and determining the audio segment corresponding to the average energy in the audio segment being smaller than a first threshold, the preset frequency energy being smaller than a second threshold and the spectral flatness being larger than a third threshold as a noise audio segment.
Optionally, performing noise suppression processing for suppressing background noise data on the second voice data includes: and adjusting the characteristic value of the second voice data to be within a preset range so that the target voice data can be identified relative to the background noise data.
Optionally, adjusting the feature value of the second voice data to be within a preset range includes: gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain; and superposing preset white noise data on the second voice data after gain according to the preset voice recognition database.
Optionally, the memory 420 is further used to provide instructions for the processor 410 to process the following processing steps: and carrying out voice recognition on the third voice data according to a preset voice recognition module.
Optionally, the voice processing system includes a dual microphone headset, comprising: and carrying out noise reduction processing on the first voice data through the double-microphone headset.
Thus, according to this embodiment, the speech processing system obtains first speech data, the first speech data including speech data and background noise data, determines a noise audio segment containing only background noise from the first speech data, and removes the noise audio segment from the first speech data to generate second speech data; and performing noise suppression processing for suppressing background noise data on the second voice data to generate third voice data. In this embodiment, the noise audio segment only including the background noise in the first voice data is removed, and the noise suppression is performed on the first voice data after the noise audio segment is removed, so that the quality of voice recognition is improved.
The voice processing device provided by the embodiment of the present application can implement each process in the foregoing method embodiment, and achieve the same functions and effects, which are not repeated here.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (8)

1. A speech processing method applied to a speech processing system, comprising:
acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain;
and superposing preset white noise data on the second voice data after gain according to the preset voice recognition database to generate third voice data.
2. The method of claim 1, wherein determining a noise audio segment containing only background noise in the first speech data comprises:
dividing the first voice data into a plurality of audio fragments according to a preset time period; and
and determining the noise audio fragments in the plurality of audio fragments according to a threshold value of a preset voice parameter.
3. The method of claim 2, wherein the predetermined voice parameters include an average energy, a predetermined frequency energy, and a spectral flatness, wherein the predetermined frequency energy is an energy corresponding to when the frequency of the first voice data is less than a predetermined value.
4. A method according to any of claims 2-3, wherein determining the noise audio segment from a plurality of the audio segments according to a threshold value of a preset speech parameter comprises:
respectively acquiring average energy, preset frequency energy and spectrum flatness in each audio fragment;
and determining the audio segment corresponding to the average energy in the audio segment being smaller than a first threshold, the preset frequency energy being smaller than a second threshold and the spectral flatness being larger than a third threshold as a noise audio segment.
5. The method of claim 1, wherein the speech processing system comprises a dual microphone headset, comprising:
and carrying out noise reduction processing on the first voice data through the double-microphone headset.
6. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 4 is performed by a processor when the program is run.
7. A speech processing device for use in a speech processing system, comprising:
the data acquisition module is used for acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
the noise removing module is used for determining a noise audio fragment only containing background noise in the first voice data, removing the noise audio fragment from the first voice data and generating second voice data; gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain;
and the noise suppression module is used for superposing preset white noise data on the second voice data after the gain according to the preset voice recognition database to generate third voice data.
8. A speech processing device for use in a speech processing system, comprising:
a first processor; and
a memory, coupled to the first processor, for providing instructions to the first processor to process the following processing steps:
acquiring first voice data, wherein the first voice data comprises target voice data and background noise data;
determining a noise audio fragment only containing background noise in the first voice data, and removing the noise audio fragment from the first voice data to generate second voice data; gain the time domain amplitude of the second voice data by a first proportional threshold according to a preset voice recognition database to obtain the second voice data after gain;
and superposing preset white noise data on the second voice data after gain according to the preset voice recognition database to generate third voice data.
CN201910955242.9A 2019-10-09 2019-10-09 Voice processing method, device and storage medium Active CN112634921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910955242.9A CN112634921B (en) 2019-10-09 2019-10-09 Voice processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910955242.9A CN112634921B (en) 2019-10-09 2019-10-09 Voice processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN112634921A CN112634921A (en) 2021-04-09
CN112634921B true CN112634921B (en) 2024-02-13

Family

ID=75283321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910955242.9A Active CN112634921B (en) 2019-10-09 2019-10-09 Voice processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112634921B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114187910A (en) * 2021-12-16 2022-03-15 平安证券股份有限公司 Information input method, device, equipment and storage medium based on voice recognition

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101872616A (en) * 2009-04-22 2010-10-27 索尼株式会社 Endpoint detection method and system using same
CN103117067A (en) * 2013-01-19 2013-05-22 渤海大学 Voice endpoint detection method under low signal-to-noise ratio
CN103903634A (en) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 Voice activation detection (VAD), and method and apparatus for the VAD
CN104464722A (en) * 2014-11-13 2015-03-25 北京云知声信息技术有限公司 Voice activity detection method and equipment based on time domain and frequency domain
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
CN109817241A (en) * 2019-02-18 2019-05-28 腾讯音乐娱乐科技(深圳)有限公司 Audio-frequency processing method, device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081287A1 (en) * 2013-09-13 2015-03-19 Advanced Simulation Technology, inc. ("ASTi") Adaptive noise reduction for high noise environments
US9842608B2 (en) * 2014-10-03 2017-12-12 Google Inc. Automatic selective gain control of audio data for speech recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101872616A (en) * 2009-04-22 2010-10-27 索尼株式会社 Endpoint detection method and system using same
CN103903634A (en) * 2012-12-25 2014-07-02 中兴通讯股份有限公司 Voice activation detection (VAD), and method and apparatus for the VAD
CN103117067A (en) * 2013-01-19 2013-05-22 渤海大学 Voice endpoint detection method under low signal-to-noise ratio
CN104464722A (en) * 2014-11-13 2015-03-25 北京云知声信息技术有限公司 Voice activity detection method and equipment based on time domain and frequency domain
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
CN109817241A (en) * 2019-02-18 2019-05-28 腾讯音乐娱乐科技(深圳)有限公司 Audio-frequency processing method, device and storage medium

Also Published As

Publication number Publication date
CN112634921A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN110827843B (en) Audio processing method and device, storage medium and electronic equipment
US20190057713A1 (en) Methods and apparatus for decoding based on speech enhancement metadata
CN109036460B (en) Voice processing method and device based on multi-model neural network
CN107172256B (en) Earphone call self-adaptive adjustment method and device, mobile terminal and storage medium
CN105355201A (en) Scene-based voice service processing method and device and terminal device
CN110706693B (en) Method and device for determining voice endpoint, storage medium and electronic device
CN106558314B (en) Method, device and equipment for processing mixed sound
CN108234793B (en) Communication method, communication device, electronic equipment and storage medium
CN112565981B (en) Howling suppression method, howling suppression device, hearing aid, and storage medium
CN110428835B (en) Voice equipment adjusting method and device, storage medium and voice equipment
CN113362839A (en) Audio data processing method and device, computer equipment and storage medium
CN112634921B (en) Voice processing method, device and storage medium
CN110931019B (en) Public security voice data acquisition method, device, equipment and computer storage medium
US8949116B2 (en) Signal processing method and apparatus for amplifying speech signals
CN104851423B (en) Sound information processing method and device
CN109215688A (en) With scene audio processing method, device, computer readable storage medium and system
US20210191685A1 (en) Spatial characteristics of multi-channel source audio
CN112992167A (en) Audio signal processing method and device and electronic equipment
CN116229987A (en) Campus voice recognition method, device and storage medium
CN116107537A (en) Audio quality adjustment method and device, electronic equipment and storage medium
CN110996205A (en) Earphone control method, earphone and readable storage medium
CN115376501B (en) Voice enhancement method and device, storage medium and electronic equipment
CN110366068B (en) Audio adjusting method, electronic equipment and device
CN111354341A (en) Voice awakening method and device, processor, sound box and television
CN110503975A (en) Smart television speech enhan-cement control method and system based on multi-microphone noise reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant