CN117037853A

CN117037853A - Audio signal endpoint detection method, device, medium and electronic equipment

Info

Publication number: CN117037853A
Application number: CN202310726709.9A
Authority: CN
Inventors: 曾焕炳
Original assignee: Xiamen Jiewei Intelligent Technology Co ltd
Current assignee: Xiamen Jiewei Intelligent Technology Co ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-11-10

Abstract

The embodiment of the application provides an endpoint detection method and device for an audio signal, a medium and electronic equipment. The method comprises the following steps: determining short-time energy and zero crossing rate corresponding to the initial voice signal, and further determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal; intercepting a voice fragment between an initial voice starting point and an initial voice ending point in an initial voice signal and carrying out noise reduction treatment on the voice fragment to obtain a denoising voice signal; and identifying a target voice starting point and a target voice ending point corresponding to the initial voice signal according to a first wave pattern image, a first frequency spectrum image and a first envelope image comprising a time domain, which are converted from the denoising voice signal, and a second wave pattern image, a second frequency spectrum image and a second envelope image comprising a time domain, which are converted from a voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal. The technical scheme of the embodiment of the application can improve the accuracy and stability of voice endpoint detection.

Description

Audio signal endpoint detection method, device, medium and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a medium, and an electronic device for detecting an endpoint of an audio signal.

Background

VAD (Voice Activity Detection), also called voice endpoint detection, is to detect the presence or absence of voice in a noise environment, and is generally applied to voice processing systems such as voice coding and voice enhancement, so as to reduce the voice coding rate, save the communication bandwidth, reduce the energy consumption of mobile devices, and improve the recognition rate. In the current technical scheme, a mode of combining short-time energy and zero crossing rate judgment is often adopted for voice endpoint detection, however, the dependence of the mode on environment and signal to noise ratio is large, and the accuracy cannot be ensured. Therefore, how to improve the accuracy and stability of voice endpoint detection becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides an endpoint detection method, device, medium and electronic equipment for audio signals, and further can improve the accuracy and stability of voice endpoint detection at least to a certain extent.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided an endpoint detection method for an audio signal, including:

determining short-time energy and zero crossing rate corresponding to the initial voice signal according to the received initial voice signal;

determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal according to the short-time energy, the zero crossing rate, a preset initial short-time energy threshold value and an initial zero crossing rate threshold value corresponding to the initial voice signal;

intercepting a voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal and carrying out noise reduction treatment on the voice fragment to obtain a noise-removed voice signal;

converting the denoised speech signal into a first waveform image, a first spectral image and a first envelope image comprising a time domain;

converting a voice segment between the initial voice starting point and the initial voice ending point in the initial voice signal into a second waveform image, a second frequency spectrum image and a second envelope image comprising a time domain;

and identifying according to the first waveform image, the first spectrum image, the first envelope image, the second waveform image, the second spectrum image and the second envelope image, and determining a target voice starting point and a target voice ending point corresponding to the initial voice signal.

According to an aspect of an embodiment of the present application, there is provided an endpoint detection apparatus for an audio signal, including:

the first determining module is used for determining short-time energy and zero crossing rate corresponding to the initial voice signal according to the received initial voice signal;

the second determining module is used for determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal according to the short-time energy corresponding to the initial voice signal, the zero crossing rate, a preset initial short-time energy threshold value and an initial zero crossing rate threshold value;

the denoising module is used for intercepting a voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal and denoising the voice fragment to obtain a denoised voice signal;

the first conversion module is used for converting the denoising voice signal into a first wave pattern image, a first frequency spectrum image and a first envelope image comprising a time domain;

the second conversion module is used for converting the voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal into a second wave pattern image, a second frequency spectrum image and a second envelope image comprising a time domain;

And the processing module is used for determining a target voice starting point and a target voice ending point corresponding to the initial voice signal according to the first waveform image, the second spectrum image, the first envelope image, the second waveform image, the second spectrum image and the second envelope image.

According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements an endpoint detection method for an audio signal as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the endpoint detection method for audio signals as described in the above embodiments.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the endpoint detection method of the audio signal provided in the above-described embodiment.

In some embodiments of the present application, according to a received initial voice signal, short-time energy and zero-crossing rate corresponding to the initial voice signal are determined, according to the short-time energy, zero-crossing rate corresponding to the initial voice signal, and a preset initial short-time energy threshold value and initial zero-crossing rate threshold value, initial voice start points and initial voice end points corresponding to the initial voice signal are determined, and voice segments between the initial voice start points and the initial voice end points in the initial voice signal are intercepted for noise reduction processing, so as to obtain a denoising voice signal, the denoising voice signal is converted into a first waveform image, a first spectrum image and a first envelope image including a time domain, voice segments between the initial voice start points and the initial voice end points in the initial voice signal are converted into a second waveform image, a second spectrum image and a second envelope image including a time domain, so that a target voice start point and a target voice end point corresponding to the initial voice are determined by recognition of the first waveform image, the first spectrum image, the first waveform image, the second spectrum image and the second envelope image including the time domain. Therefore, the starting point and the ending point of the voice signal are determined based on the combination of short-time energy, zero-crossing rate and image recognition, and the accuracy and the stability of the determined end point detection result are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

fig. 1 shows a flow diagram of a method of endpoint detection of an audio signal according to an embodiment of the application;

fig. 2 shows a block diagram of an end-point detection device of an audio signal according to an embodiment of the application;

fig. 3 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Fig. 1 shows a flow chart of a method of endpoint detection of an audio signal according to an embodiment of the application. The method may apply to a terminal device or server, which may include, but is not limited to, one or more of a smart phone, a tablet computer, a laptop computer, and a desktop computer; the server may be a physical server or a cloud server.

As shown in fig. 1, the method for detecting an endpoint of an audio signal at least includes steps S110 to S160, and is described in detail as follows (hereinafter, the method is applied to a terminal device for example, and is abbreviated as "terminal"):

in step S110, short-time energy and zero crossing rate corresponding to the initial voice signal are determined according to the received initial voice signal.

In an example, a sound receiving device (such as a microphone) may be disposed on the terminal, and when voice recognition is required, for example, voice input, subtitle recognition, etc., the terminal may obtain an initial voice signal input by a user through the sound receiving device; in another example, the terminal may obtain the initial voice signal from a local storage space or the internet. The method for acquiring the corresponding initial voice signal can be selected by those skilled in the art according to the actual implementation requirements, and is not particularly limited.

Then, after acquiring the initial voice signal, the terminal may calculate its corresponding short-time energy and zero-crossing rate from the voice signal.

In step S120, an initial speech start point and an initial speech end point corresponding to the initial speech signal are determined according to the short-time energy, the zero-crossing rate, the preset initial short-time energy threshold value and the initial zero-crossing rate threshold value corresponding to the initial speech signal.

In one embodiment, it should be appreciated that speech signals can be generally divided into silence segments, unvoiced segments and voiced segments, the silence segments being background noise segments with the lowest average energy; the voiced sound section is a voice signal section corresponding to vocal cord vibration, and the average energy is highest; the unvoiced segments are segments of speech signals emitted by friction, impact or explosion of air in the mouth, with average energy in between. The wave pattern characteristics of the unvoiced segment and the unvoiced segment are obviously different, the signal of the unvoiced segment changes slowly, the signal of the unvoiced segment changes severely in amplitude, the number of times of crossing zero level is also large, and the zero crossing rate of the unvoiced segment is usually maximum. The endpoint detection is to first determine whether it is/voiced 0 or unvoiced 0, and if so, whether it is/unvoiced 0 or voiced 0.

For this purpose, the terminal device may query and obtain a preset initial short-time energy threshold and an initial zero-crossing rate threshold, where the initial short-time energy threshold includes a high energy value and a low energy value. And determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal according to the short-time energy, the zero-crossing rate, the initial short-time energy threshold value and the initial zero-crossing rate threshold value corresponding to the initial voice signal. It should be noted that, the endpoint detection based on the short-time energy and the zero crossing rate may be performed by using an existing detection method, which is not described herein.

In step S130, a speech segment between the initial speech start point and the initial speech end point in the initial speech signal is intercepted and noise reduction is performed to obtain a noise-removed speech signal.

In this embodiment, the terminal may intercept a voice segment between the initial voice start point and the initial voice end point in the initial voice signal, and may reduce the influence of the silence segment on subsequent recognition. And performing noise reduction processing on the intercepted voice fragments so as to remove noise in the voice fragments and obtain a noise-removed voice signal. It should be noted that, the terminal may use an existing denoising algorithm to denoise the intercepted speech segment, which is not limited in particular.

In step S140, the denoised speech signal is converted into a first waveform image, a first spectral image and a first envelope image comprising a time domain.

In this embodiment, the terminal may convert the denoised speech signal into a corresponding first waveform image, a first spectral image and a first envelope image comprising the time domain based on the denoised speech signal. It should be noted that the speech signal is a complex multi-frequency signal, and each frequency component has a different amplitude. When they are arranged according to the frequency, the curve connected to the top is the envelope image. The shape of the envelope varies with the sound emitted. The acoustic wave generated by the vocal cord vibration resonates when passing through the vocal tract constituted by the oral cavity, the nasal cavity, and the like. The result of resonance is that certain regions of the spectrum are reinforced. Thus, the shape of the envelope varies from person to person. In general, the envelope has several peaks and valleys, wherein the first three formants contain most of the care of the language, their frequency and amplitude vary with the sound emitted.

In an example, in the field of image processing, a filter shaped as a spectral envelope, i.e. a spectral envelope filter, can be established, whose values at each point of the image correspond to the correction parameters of the image spectrum. The region with larger value corresponds to an energy gathering region which needs to be reserved in the image spectrum and has important effect on the reconstructed image; the region with smaller value corresponds to the background noise region to be suppressed in the image spectrum, the image spectrum corrected by the spectrum envelope filter basically reserves the image characteristics of the original spectrum, and the frequency components of the background noise are suppressed while the energy gathering region of the image is well protected, so that the filtered image is well combined with the edge protection and smoothness.

In step S150, a speech segment between the initial speech start point and the initial speech end point in the initial speech signal is converted into a second waveform image, a second spectrum image, and a second envelope image including a time domain.

In this embodiment, the terminal may convert a speech segment between an initial speech start point and an initial speech end point in the initial speech signal, which is not subjected to the denoising process, into a corresponding second waveform image, second spectrum image, and second envelope image including a time domain.

In step S160, a target speech start point and a target speech end point corresponding to the initial speech are determined according to the first waveform image, the second spectrum image, the first envelope image, the second waveform image, the second spectrum image, and the second envelope image.

In an embodiment, the terminal device may perform image recognition based on the first waveform image, the first spectrum image, the first envelope image, the second waveform image, the second spectrum image, and the second envelope image, so as to determine a target voice start point and a target voice end point corresponding to the initial voice.

In an example, the terminal may perform end point detection according to each image to determine a voice start point and a voice end point corresponding to each image, and then perform weighted average calculation on the voice start point and the voice end point determined based on each image to determine a target voice start point and a target voice end point corresponding to the initial voice signal.

In an example, when the endpoint detection is performed on the image, the local highest point of the image can be found according to the outline of the image, points on two sides of the highest point are searched, the front point and the rear point are compared, if the front point is smaller than the rear point, the points on two sides of the highest point are considered to be the starting point and the ending point of the word or the syllable respectively, then the local highest point is continuously found, and the target voice starting point and the target voice ending point are sequentially found in this way.

Thus, based on the embodiment shown in fig. 1, the short-time energy and the zero-crossing rate corresponding to the initial voice signal are determined according to the received initial voice signal, the initial voice starting point and the initial voice ending point corresponding to the initial voice signal are determined according to the short-time energy, the zero-crossing rate corresponding to the initial voice signal and the preset initial short-time energy threshold value and the initial zero-crossing rate threshold value, a voice segment between the initial voice starting point and the initial voice ending point in the initial voice signal is intercepted for noise reduction processing, a denoising voice signal is obtained, the denoising voice signal is converted into a first wave-type image, a first frequency spectrum image and a first envelope image comprising a time domain, a voice segment between the initial voice starting point and the initial voice ending point in the initial voice signal is converted into a second wave-type image, a second frequency spectrum image and a second envelope image comprising a time domain, and the target voice starting point and the target voice ending point corresponding to the initial voice are determined according to the first wave-type image, the first frequency spectrum image, the first envelope image, the second wave-type image, the second frequency spectrum image and the second envelope image. Therefore, the starting point and the ending point of the voice signal are determined based on the combination of short-time energy, zero-crossing rate and image recognition, and the accuracy and the stability of the determined end point detection result are improved.

In one embodiment of the present application, determining a target speech start point and a target speech end point corresponding to the initial speech signal according to the first waveform image, the first spectrum image, the first envelope image, the second waveform image, the second spectrum image, and the second envelope image includes:

comparing the first wave pattern image with the second wave pattern image, the first spectrum image with the second spectrum image and the first envelope image with the second envelope image respectively, and determining the similarity corresponding to each pair of images;

and identifying a pair of images with highest similarity, and determining a target voice starting point and a target voice ending point corresponding to the initial voice signal.

In this embodiment, the terminal may compare the same type of images to determine the similarity between each pair of images. The first wave pattern image is compared with the second wave pattern image, the first spectrum image is compared with the second spectrum image, and the first envelope image is compared with the second envelope image.

Then, the terminal may perform end point detection based on a pair of images having the highest similarity, thereby determining a target voice start point and a target voice end point corresponding to the initial voice signal. In an example, the terminal may perform endpoint detection on two images in the pair of images, and then perform weighted average operation on the two endpoint detection results to determine a corresponding target voice start point and a target voice end point. In another example, the terminal may perform end point detection on an image converted from the denoised speech signal in the pair of images to determine the corresponding target speech start point and target end point. It should be appreciated that the image converted after the denoising process reduces the influence of noise on the end point detection result, thereby improving the accuracy and stability of the end point detection result.

In one embodiment of the present application, before comparing the first waveform image with the second waveform image, the first spectrum image with the second spectrum image, and the first envelope image with the second envelope image, respectively, the method further includes, before determining a similarity corresponding to each pair of images:

and performing graphic transformation processing on the first waveform image, the first spectrum image, the first envelope image, the second waveform image, the second spectrum image and the second envelope image to obtain corresponding binary images, wherein the graphic transformation processing comprises stretching and/or amplifying.

In this embodiment, the terminal performs a graphic transformation process on each image before performing the end point detection, the graphic transformation process including stretching and/or enlarging to convert each image into a binary image, whereby the accuracy of the result of the comparison can be improved, and the accuracy of the subsequent end point detection can also be improved.

In one embodiment of the present application, capturing a speech segment between the initial speech start point and the initial speech end point in the initial speech signal and performing noise reduction processing on the speech segment to obtain a noise-removed speech signal, including:

Intercepting a voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal, and carrying out framing treatment on the voice fragment so as to carry out denoising decoding calculation to obtain an initial denoising voice signal;

performing noise estimation on the initial denoising voice signal, and determining a corresponding noise level;

and when the noise level meets a preset condition, inputting the first denoising voice signal into a filter to remove an echo signal, and obtaining a target denoising voice signal.

In an embodiment, the terminal may perform frame processing on the intercepted speech segments, and take each frame of noisy speech signal as input of a deep noise reduction model, and perform denoising decoding calculation through the deep noise reduction model to obtain an initial denoised speech signal. It should be noted that the deep noise reduction model may be an existing noise reduction model, which is not limited in particular by the present application.

Then, the terminal may perform noise estimation on the initial denoising voice signal to determine a corresponding noise level, and when the noise level is low, the terminal may not perform subsequent denoising processing, and if the noise level is high, the initial denoising voice signal may be input to a filter to remove an echo signal included in the initial denoising voice signal, so as to obtain the target denoising voice signal. The filter can adopt a self-adaptive algorithm to control parameters thereof, simulate the channel environment generated by echo, and further estimate echo signals for elimination, so as to obtain target denoising voice signals. Therefore, through secondary noise reduction, noise in the initial voice signal can be effectively removed, and the accuracy and stability of a subsequent endpoint detection result are improved.

In an embodiment, before framing the speech segment for denoising decoding calculation, the method further comprises:

and performing format conversion on the voice fragments to obtain the voice fragments with single channels.

In this embodiment, for a speech recognition task, a single-channel speech signal can meet a recognition requirement, and a single-channel speech fragment is obtained by performing format conversion on the speech fragment, so that the calculation resources occupied by the subsequent steps can be reduced, the accuracy of a recognition result is ensured, and the calculation cost is saved.

In one embodiment of the present application, determining an initial voice start point and an initial voice end point corresponding to the initial voice signal according to the short-time energy, the zero crossing rate corresponding to the initial voice signal, and a preset initial short-time energy threshold value and an initial zero crossing rate threshold value includes:

optimizing the initial energy threshold value and the initial zero crossing rate threshold value according to short-time energy and zero crossing rate corresponding to the preset number of voice frames arranged in the initial voice signal and a preset initial short-time energy threshold value and an initial zero crossing rate threshold value to obtain a target energy threshold value and a target zero crossing rate threshold value;

And determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal according to the short-time energy, the zero crossing rate, the target short-time energy threshold value and the target zero crossing rate threshold value corresponding to the initial voice signal.

In this embodiment, since the first short period of time in the collected sound signal is mostly silence or background noise, the zero-crossing rate threshold zcr and the low energy threshold amp2 and the high energy threshold amp1 can be calculated using the first few frames (typically 10 frames, i.e. the number of speech frames arranged in advance) of the sound signal known as "stationary". In calculating amp2 and amp1, first, the short-time average energy or average amplitude of each frame in the first 10 frames of signals is calculated, and emax is denoted as max and min is the minimum.

Before the short-time energy is calculated, a filter is first passed, which is a high-pass filter, and this is a pre-emphasis filter, so as to filter out low-frequency interference, especially power frequency interference of 50Hz or 60Hz, and to boost the frequency of a high-frequency part which is more useful for language identification. The filter is applied before the short-time energy is calculated, and the effects of eliminating direct current drift, inhibiting random noise and improving the energy of the unvoiced part can be achieved.

According to the short-time energy amp of the voice signal of the previous 10 frames, the energy threshold value is adjusted according to the following formula:

amp1 _d ＝min(amp1,max(amp)/4)；

amp2 _d ＝min(amp2,max(amp)/8)；

wherein, amp1 is the high energy threshold value in the initial short-time energy threshold value, and amp2 is the low energy threshold value in the initial short-time energy threshold value.

Therefore, the threshold value is adjusted according to the actual condition of the voice signal, so that voice endpoints can be better detected, and the accuracy of the subsequent endpoint detection result is improved.

In one embodiment of the present application, before determining the short-time energy and the zero-crossing rate corresponding to the initial speech signal according to the received initial speech signal, the method further includes:

and carrying out trend removal item and filtering treatment on the initial voice signal.

In this embodiment, a linear or slowly varying trend error, such as zero drift of the amplifier with temperature change, etc., occurs in the time sequence during the process of collecting the voice signal, and the zero line of the voice signal deviates from the baseline, and even the magnitude of the deviation from the baseline varies with time. Deviations of the zero line from the baseline over time are known as trend terms of the signal. The presence of trend term errors can distort the correlation function, the power spectrum function in the processing calculation, and even can completely lose the authenticity and correctness of the spectrum estimation of the low frequency band.

Therefore, the trend term elimination and filtering processing can be carried out on the initial voice signal, so that the accuracy of the subsequent endpoint detection result is improved. In an example, the trend term elimination and filtering process may be performed based on a least square method, and in other embodiments, other algorithms may be used, which the present application is not limited to.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the method of endpoint detection of an audio signal in the above-described embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for detecting an endpoint of an audio signal.

Fig. 2 shows a block diagram of an end point detection device of an audio signal according to an embodiment of the application.

Referring to fig. 2, an endpoint detection apparatus for an audio signal according to an embodiment of the present application includes:

In one embodiment of the application, the processing module is configured to: comparing the first wave pattern image with the second wave pattern image, the first spectrum image with the second spectrum image and the first envelope image with the second envelope image respectively, and determining the similarity corresponding to each pair of images; and identifying a pair of images with highest similarity, and determining a target voice starting point and a target voice ending point corresponding to the initial voice signal.

In one embodiment of the present application, before comparing the first waveform image with the second waveform image, the first spectrum image with the second spectrum image, and the first envelope image with the second envelope image, respectively, the processing module is further configured to: and performing graphic transformation processing on the first waveform image, the first spectrum image, the first envelope image, the second waveform image, the second spectrum image and the second envelope image to obtain corresponding binary images, wherein the graphic transformation processing comprises stretching and/or amplifying.

In one embodiment of the present application, the denoising module is configured to: intercepting a voice fragment between the initial voice starting point and the initial voice ending point in the initial voice signal, and carrying out framing treatment on the voice fragment so as to carry out denoising decoding calculation to obtain an initial denoising voice signal; performing noise estimation on the initial denoising voice signal, and determining a corresponding noise level; and when the noise level meets a preset condition, inputting the first denoising voice signal into a filter to remove an echo signal, and obtaining a target denoising voice signal.

In one embodiment of the present application, before framing the speech segment for denoising decoding calculation, the denoising module is further configured to: and performing format conversion on the voice fragments to obtain the voice fragments with single channels.

In one embodiment of the present application, the second determining module is configured to: optimizing the initial energy threshold value and the initial zero crossing rate threshold value according to short-time energy and zero crossing rate corresponding to the preset number of voice frames arranged in the initial voice signal and a preset initial short-time energy threshold value and an initial zero crossing rate threshold value to obtain a target energy threshold value and a target zero crossing rate threshold value; and determining an initial voice starting point and an initial voice ending point corresponding to the initial voice signal according to the short-time energy, the zero crossing rate, the target short-time energy threshold value and the target zero crossing rate threshold value corresponding to the initial voice signal.

In one embodiment of the present application, before determining the short-time energy and the zero-crossing rate corresponding to the initial speech signal according to the received initial speech signal, the first determining module is further configured to: and carrying out trend removal item and filtering treatment on the initial voice signal.

It should be noted that, the computer system of the electronic device shown in fig. 3 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 3, the computer system includes a central processing unit (Central Processing Unit, CPU) 301 that can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 302 or a program loaded from a storage section 308 into a random access Memory (RandomAccess Memory, RAM) 303. In the RAM 303, various programs and data required for the system operation are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An Input/Output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, and the like; an output portion 307 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, a speaker, and the like; a storage section 308 including a hard disk or the like; and a communication section 309 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. The drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage section 308 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 309, and/or installed from the removable medium 311. When executed by a Central Processing Unit (CPU) 301, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for endpoint detection of an audio signal, comprising:

2. The method of claim 1, wherein determining a target speech start point and a target speech end point for the initial speech signal based on the first waveform image, the first spectral image, the first envelope image, the second waveform image, the second spectral image, and the second envelope image comprises:

3. The method of claim 2, wherein prior to comparing the first and second waveform images, the first and second spectral images, and the first and second envelope images, respectively, to determine the corresponding similarity for each pair of images, the method further comprises:

4. The method of claim 1, wherein capturing and denoising a speech segment between the initial speech start point and the initial speech end point in the initial speech signal to obtain a denoised speech signal comprises:

5. The method of claim 4, wherein prior to framing the speech segment for denoising decoding calculation, the method further comprises:

6. The method of claim 1, wherein determining the initial speech start point and the initial speech end point corresponding to the initial speech signal according to the short-time energy, the zero-crossing rate, the preset initial short-time energy threshold value, and the initial zero-crossing rate threshold value corresponding to the initial speech signal comprises:

7. The method of claim 1, wherein prior to determining the short-time energy and zero-crossing rate corresponding to the initial speech signal from the received initial speech signal, the method further comprises:

8. An end point detection device for an audio signal, comprising:

9. A computer readable medium on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method of endpoint detection of an audio signal according to any of claims 1 to 7.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the endpoint detection method of an audio signal as claimed in any one of claims 1 to 7.