CN116364107A

CN116364107A - Voice signal detection method, device, equipment and storage medium

Info

Publication number: CN116364107A
Application number: CN202310271742.7A
Authority: CN
Inventors: 张林山
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-06-30

Abstract

The invention is applicable to the technical field of computers and provides a voice signal detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: and performing filtering processing and frequency division operation on the audio frame to be detected to obtain a frequency spectrum of the audio frame to be detected in a preset frequency range, obtaining frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum, judging whether the audio frame to be detected is in the preset frame number range, determining the audio frame to be detected as a non-voice signal when the audio frame to be detected is not in the frame number range, calculating total energy of the audio frame to be detected according to the frequency band energy of each frequency band when the audio frame to be detected is not in the frame number range, obtaining frequency band likelihood ratios of each frequency band according to the frequency band energy characteristics, a pre-established voice signal model and the non-voice signal model, and judging whether the audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratios, thereby improving the accuracy of voice detection.

Description

Voice signal detection method, device, equipment and storage medium

Technical Field

The present invention belongs to the field of computer technology, and in particular, relates to a method, an apparatus, a device, and a storage medium for detecting a voice signal.

Background

With the rapid development of artificial intelligence, people have increasingly recognized the importance of speech recognition technology through practice, and speech signal detection is one of the key technologies of speech recognition systems in the speech processing stage, and the accuracy of speech signal detection directly determines whether the entire speech recognition system can achieve true intelligentization. Existing speech signal detection typically uses Voice Activity Detection (VAD) technology to identify and eliminate long periods of silence in audio, and to conserve bandwidth resources without degrading speech quality. However, when the VAD algorithm based on the zero-crossing rate and short-time energy detects the voice signal, the voice signal detection cannot be accurately performed on the audio in a noisy environment, and the voice signal detection method based on the deep neural network can accurately detect the audio in a high-noise environment, but the algorithm complexity of the deep neural network is high, the operation amount is large, and the power consumption is large.

Disclosure of Invention

The invention aims to provide a voice signal detection method, a device, equipment and a storage medium, and aims to solve the problem that the voice signal detection is inaccurate because the prior art cannot provide an effective voice signal detection method.

In one aspect, the present invention provides a method for detecting a voice signal, the method comprising the steps of:

performing filtering processing and frequency division operation on an audio frame to be detected to obtain a frequency spectrum of the audio frame to be detected in a preset frequency range, and obtaining frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum;

judging whether the audio frame to be detected is within a preset frame number range, and determining the audio frame to be detected as a non-voice signal when the audio frame to be detected is within the frame number range;

when the audio frame to be detected is not in the frame number range, calculating total energy of the audio frame to be detected according to the frequency band energy of each frequency band, acquiring frequency band likelihood ratios of each frequency band according to the frequency band energy characteristics, a pre-established voice signal model and the non-voice signal model, and judging whether the audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratios.

In another aspect, the present invention provides a voice signal detection apparatus, the apparatus comprising:

the frequency spectrum acquisition unit is used for carrying out filtering processing and frequency division operation on the audio frame to be detected to obtain the frequency spectrum of the audio frame to be detected in a preset frequency range, and acquiring the frequency band energy and the frequency band energy characteristics of each frequency band in the frequency spectrum;

a range judging unit, configured to judge whether the audio frame to be detected is within a preset frame number range, and when the audio frame to be detected is within the frame number range, determine the audio frame to be detected as a non-speech signal;

and the signal detection unit is used for calculating the total energy of the audio frame to be detected according to the frequency band energy of each frequency band when the audio frame to be detected is not in the frame number range, acquiring the frequency band likelihood ratio of each frequency band according to the frequency band energy characteristics, a pre-established voice signal model and the non-voice signal model, and judging whether the audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratio.

In another aspect, the present invention also provides a speech signal detection device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that the steps of the method as described above are implemented when said computer program is executed by said processor.

In another aspect, the invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

The method comprises the steps of carrying out filtering processing and frequency division operation on an audio frame to be detected, obtaining a frequency spectrum of the audio frame to be detected in a preset frequency range, obtaining frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum, judging whether the audio frame to be detected is in the preset frame number range, determining the audio frame to be detected as a non-voice signal when the audio frame to be detected is in the frame number range, calculating total energy of the audio frame to be detected according to the frequency band energy of each frequency band when the audio frame to be detected is not in the frame number range, obtaining frequency band likelihood ratios of each frequency band according to the frequency band energy characteristics, a pre-established voice signal model and the non-voice signal model, and judging whether the audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratios, so that the accuracy of voice detection is improved.

Drawings

Fig. 1 is a flowchart of a voice signal detection method according to an embodiment of the present invention;

fig. 2 is a flowchart of a voice signal detection method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a voice signal detecting device according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice signal detection apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The following describes in detail the implementation of the present invention in connection with specific embodiments:

embodiment one:

fig. 1 shows a flow of implementation of the voice signal detection method according to the first embodiment of the present invention, and for convenience of explanation, only the portions relevant to the embodiment of the present invention are shown, which are described in detail below:

in step S101, filtering processing and frequency division operation are performed on the audio frame to be detected, so as to obtain a frequency spectrum of the audio frame to be detected within a preset frequency range, and obtain frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum.

The embodiment of the invention is suitable for voice signal detection equipment, and is used for detecting voice signals of audios, in particular, detecting voice signals of audios by detecting each audio frame in the audios. The voice signal detection device may be a personal computer or a server, after receiving an audio frame to be detected, the voice signal detection device performs filtering processing on the audio frame to be detected to obtain a signal within a preset frequency range, and as an example, the signal below 80 hz and above 4000 hz may be filtered, so that the filtered signal does not contain direct current and alternating current interference signals below 80 hz and does not contain high frequency signals above 4000 hz, then the audio frame to be detected after the filtering processing is performed on the audio frame to be detected, so as to obtain a frequency spectrum of the filtered audio frame to be detected, finally, the frequency spectrum characteristics of the frequency spectrum, and the frequency band energy characteristics of each frequency band in the frequency spectrum are obtained, wherein the frequency spectrum characteristics include the frequency spectrum energy distribution of the audio frame to be detected, that is, the energy size of each frequency band refers to the total energy of a section of frequency band, and the frequency band energy characteristics include the energy size of each frequency in the frequency band.

In step S102, it is determined whether the audio frame to be detected is within a preset frame number range, and when the audio frame to be detected is within the frame number range, the audio frame to be detected is determined as a non-speech signal.

In the embodiment of the invention, the beginning part of the recorded audio is considered to have environmental noise only, so that when the audio frame to be detected is determined whether to be a non-voice signal, whether the audio frame to be detected is within the preset frame number range or not can be judged, and when the audio frame to be detected is within the frame number range, the audio frame to be detected is directly determined to be the non-voice signal, thereby simplifying the judging process of the non-voice signal. Specifically, the sequence number of the audio frame to be detected, that is, the sequence of the audio frame to be detected in all audio frames of the input audio, may be obtained, and then, whether the audio frame to be detected is within a preset frame number range is determined according to the sequence number, and when the audio frame to be detected is within the frame number range, the audio frame to be detected is determined to be a non-speech signal. For example, when the audio frame to be detected is the 5 th frame detected in the audio to be detected, if the preset frame number is 8, the audio frame to be detected is determined to be a non-speech signal within the preset frame number range.

In a preferred embodiment, when the audio frame to be detected is a non-speech signal, a pre-established non-speech signal model is updated according to the spectral characteristics of the current audio frame to be detected, so that the non-speech signal model is updated in real time through the current detected audio, and the effectiveness of the non-speech signal model in the subsequent use process is improved, wherein the non-speech signal model is used for extracting the energy characteristics of the non-speech audio frame.

In step S103, when the audio frame to be detected is not within the frame number range, the total energy of the audio frame to be detected is calculated according to the frequency band energy of each frequency band, the frequency band likelihood ratio of each frequency band is obtained according to the frequency band energy characteristics, the pre-established speech signal model and the non-speech signal model, and whether the audio frame to be detected is a speech signal is judged according to the total energy and the frequency band likelihood ratio.

In an embodiment of the invention, the speech signal model is used for extracting energy features of the speech audio frames. When the audio frame to be detected is not in the preset frame number range, calculating total energy of the current audio frame to be detected according to frequency band energy of each frequency band, further obtaining frequency band likelihood ratios of each frequency band according to frequency band energy characteristics of each frequency band, a pre-established voice signal model and a pre-established non-voice signal model, wherein the frequency band likelihood ratios refer to ratios of voice components and non-voice components contained in each frequency band of the current audio frame, and judging whether the current audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratios after obtaining the total energy of the audio frame to be detected and the frequency band likelihood ratios of each frequency band.

In the embodiment of the invention, after receiving the audio to be detected, filtering and frequency division operation are carried out on the audio frame to be detected, so as to obtain the frequency spectrum of the audio frame to be detected in a preset frequency range, the frequency band energy and the frequency band energy characteristics of each frequency band in the frequency spectrum are obtained, whether the audio frame to be detected is in the preset frame number range is judged, when the audio frame to be detected is not in the frame number range, the total energy of the audio frame to be detected is calculated according to the frequency band energy of each frequency band, the frequency band likelihood ratio of each frequency band is obtained according to the frequency band energy characteristics, the pre-established voice signal model and the non-voice signal model, and whether the audio frame to be detected is a voice signal is judged according to the total energy and the frequency band likelihood ratio, so that the accuracy of voice detection is improved.

Embodiment two:

fig. 2 shows a flow of implementation of the voice signal detection method according to the second embodiment of the present invention, and for convenience of explanation, only the portions related to the embodiments of the present invention are shown, which are described in detail below:

in step S201, filtering and frequency division are performed on the audio frame to be detected, so as to obtain a frequency spectrum of the audio frame to be detected within a preset frequency range, and obtain frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum.

In step S202, it is determined whether the audio frame to be detected is within a preset frame number range, and when the audio frame to be detected is within the frame number range, the audio frame to be detected is determined as a non-speech signal.

In the embodiment of the present invention, steps S201 to S202 are the same as the implementation of steps S101 to S102 in the first embodiment, and reference may be made to the corresponding description of the first embodiment, which is not repeated herein.

In step S203, when the audio frame to be detected is not within the frame number range, the total energy of the audio frame to be detected is calculated according to the frequency band energy of each frequency band, the frequency band likelihood ratio of each frequency band is obtained according to the frequency band energy characteristics, the pre-established speech signal model and the non-speech signal model, and whether the audio frame to be detected is a speech signal is determined according to the total energy and the frequency band likelihood ratio.

In the embodiment of the invention, when the audio frame to be detected is not in the preset frame number range, the total energy of the current audio frame to be detected is calculated according to the frequency band energy of each frequency band, then the frequency band likelihood ratio of each frequency band is further obtained according to the frequency band energy characteristics of each frequency band, the pre-established voice signal model and the non-voice signal model, after the total energy of the audio frame to be detected and the frequency band likelihood ratio of each frequency band are obtained, whether the current audio frame to be detected is a voice signal is judged according to the total energy and the frequency band likelihood ratio, and therefore, the voice accurate detection of the audio frame in a strong noise environment is realized by using the voice signal model and the non-voice signal model. Specifically, when the total energy of the audio frame to be detected is calculated from the band energies of the respective bands, the band energies of the respective bands may be added, and the obtained result may be set as the total energy of the audio frame to be detected.

In a specific embodiment, when likelihood ratios of each frequency band are obtained according to the frequency band energy characteristics, a pre-established voice signal model and a non-voice signal model, the voice energy characteristics of a voice audio frame are obtained in advance through the voice signal model, the non-voice energy characteristics of the non-voice audio frame are obtained in advance through the non-voice signal model, the similarity between the frequency band energy characteristics and the voice energy characteristics of each frequency band in the audio frame to be detected is obtained, the similarity is called voice similarity for convenience of description, the voice similarity represents probability that the current frequency band has the voice energy characteristics, the similarity between the frequency band energy characteristics and the non-voice energy characteristics is obtained again, the similarity is called non-voice similarity for convenience of description, the non-voice similarity represents probability that the current frequency band has the non-voice energy characteristics, and the ratio of the voice similarity to the non-voice similarity is used as the likelihood ratio of the current frequency band.

Preferably, the speech signal model and the non-speech signal model are mixed Gaussian models, so that the speech signal and the non-speech signal are accurately quantized, and further, the relevant characteristics of the speech signal and the non-speech signal are accurately obtained.

Preferably, when judging whether the audio frame to be detected is a voice signal according to the total energy and the frequency band likelihood ratio, firstly judging whether the total energy of the audio frame to be detected is greater than a preset voice energy threshold, if the total energy is greater than the voice energy threshold, determining the current audio frame to be detected as the voice signal, if the total energy is not greater than the voice energy threshold, judging the frequency band likelihood ratio of each frequency band, judging whether the frequency band likelihood ratio of each frequency band is greater than the preset likelihood ratio threshold, if the frequency band likelihood ratio is greater than the likelihood ratio threshold, determining the current audio frame to be detected as the voice signal, and if the frequency band likelihood ratio is not greater than the likelihood ratio threshold, determining the current audio frame to be detected as the non-voice signal, thereby performing multi-level judgment on the audio frame to be detected through the energy and each frequency band likelihood ratio, and improving the judgment accuracy and success rate.

In a specific embodiment, when judging whether the likelihood ratio of the frequency band is greater than a preset likelihood ratio threshold, if the likelihood ratio of the frequency band is greater than the likelihood ratio threshold, determining that the audio frame to be detected is a voice signal, comparing the likelihood ratio of the frequency band of each frequency band with the likelihood ratio threshold, and if the likelihood ratio of the frequency band of at least one frequency band is greater than the likelihood ratio threshold, determining that the audio frame to be detected is a voice signal, thereby improving the accuracy of voice signal detection.

In step S204, when the audio frame to be detected is a speech signal, the speech signal model is updated according to the spectral characteristics of the audio frame to be detected.

In the embodiment of the invention, when the current audio frame to be detected is determined to be the voice signal after multiple judgment, the voice signal model is updated according to the frequency spectrum characteristics of the audio frame to be detected, so that the parameters of the voice signal model can be updated in time, and the effectiveness of the model is ensured.

In step S205, when the audio frame to be detected is a non-speech signal, the non-speech signal model is updated according to the spectral characteristics of the audio frame to be detected.

In the embodiment of the invention, when the current audio frame to be detected is determined to be a non-voice signal after multiple judgment, the non-voice signal model is updated according to the frequency spectrum characteristics of the audio frame to be detected, so that the parameters of the non-voice signal model can be updated in time, and the effectiveness of the model is ensured.

In the embodiment of the invention, after receiving the audio to be detected, filtering and frequency division operation are carried out on the audio frame to be detected, so as to obtain the frequency spectrum of the audio frame to be detected in a preset frequency range, frequency band energy characteristics of each frequency band in the frequency spectrum are obtained, whether the audio frame to be detected is in the preset frame number range is judged, when the audio frame to be detected is not in the frame number range, total energy of the audio frame to be detected is calculated according to the frequency band energy of each frequency band, the frequency band likelihood ratio of each frequency band is obtained according to the frequency band energy characteristics, a pre-established voice signal model and the non-voice signal model, whether the audio frame to be detected is a voice signal is judged according to the total energy and the frequency band likelihood ratio, when the audio frame to be detected is a voice signal, the voice signal model is updated according to the frequency spectrum characteristics of the audio frame to be detected, and when the audio frame to be detected is a non-voice signal, the non-voice signal model is updated according to the frequency spectrum characteristics of the audio frame to be detected, thereby updating parameters of the voice signal model and the non-voice signal model in time, and improving the effectiveness and accuracy of detection of the voice signal.

Embodiment III:

fig. 3 shows the structure of a voice signal detecting apparatus according to a third embodiment of the present invention, and for convenience of explanation, only the portions related to the embodiment of the present invention are shown, including:

the frequency spectrum obtaining unit 31 is configured to perform filtering processing and frequency division operation on an audio frame to be detected, obtain a frequency spectrum of the audio frame to be detected within a preset frequency range, and obtain frequency band energy and frequency band energy characteristics of each frequency band in the frequency spectrum;

a range judging unit 32 for judging whether the audio frame to be detected is within a preset frame number range, and determining the audio frame to be detected as a non-speech signal when the audio frame to be detected is within the frame number range;

the signal detection unit 33 is configured to calculate total energy of the audio frame to be detected according to the frequency band energy of each frequency band when the audio frame to be detected is not within the frame number range, obtain a frequency band likelihood ratio of each frequency band according to the frequency band energy characteristics, the pre-established speech signal model and the non-speech signal model, and determine whether the audio frame to be detected is a speech signal according to the total energy and the frequency band likelihood ratio.

In the embodiment of the present invention, each unit of the voice signal detection device may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into one software or hardware unit, which is not used to limit the present invention. The specific implementation of each unit may refer to the description of the foregoing method embodiment, and will not be repeated here.

Embodiment four:

fig. 4 shows the structure of a voice signal detecting apparatus provided in the fourth embodiment of the present invention, and only the portions relevant to the embodiment of the present invention are shown for convenience of explanation.

The computer device 4 of the embodiment of the present invention comprises a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer program 42, implements the steps of the above-described embodiment of the speech signal detection method, such as steps S101 to S103 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, performs the functions of the units in the above-described device embodiments, for example the functions of the units 31 to 33 shown in fig. 3.

The steps of the recommendation device 4 when the processor 40 executes the computer program 42 to implement the voice signal detection method can refer to the description of the foregoing method embodiments, and will not be repeated here.

Fifth embodiment:

in an embodiment of the present invention, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps in the above-described embodiment of the voice signal detection method, for example, steps S101 to S103 shown in fig. 1. Alternatively, the computer program, when executed by a processor, implements the functions of the units in the above-described embodiments of the apparatus, such as the functions of the units 31 to 33 shown in fig. 3.

The computer readable storage medium of embodiments of the present invention may include any entity or device capable of carrying computer program code, recording medium, such as ROM/RAM, magnetic disk, optical disk, flash memory, and so on.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method for detecting a speech signal, the method comprising the steps of:

2. The method of claim 1, further comprising, after the step of determining whether the audio frame to be detected is a speech signal based on the total energy and the band likelihood ratio:

when the audio frame to be detected is a voice signal, updating the voice signal model according to the frequency spectrum characteristics of the audio frame to be detected;

and when the audio frame to be detected is a non-voice signal, updating the non-voice signal model according to the frequency spectrum characteristics of the audio frame to be detected.

3. The method of claim 1, wherein the step of determining whether the audio frame to be detected is a speech signal based on the total energy and a band likelihood ratio comprises:

judging whether the total energy is larger than a preset voice energy threshold value, if so, determining that the audio frame to be detected is a voice signal;

if the frequency band likelihood ratio is not greater than the voice energy threshold, judging whether the frequency band likelihood ratio is greater than a preset likelihood ratio threshold, and if the frequency band likelihood ratio is greater than the likelihood ratio threshold, determining that the audio frame to be detected is a voice signal;

and if the frequency band likelihood ratio is not greater than the likelihood ratio threshold, determining that the audio frame to be detected is a non-voice signal.

4. A method according to claim 3, wherein the step of determining whether the band likelihood ratio is greater than a predetermined likelihood ratio threshold, and if the band likelihood ratio is greater than the likelihood ratio threshold, determining that the audio frame to be detected is a speech signal comprises:

and respectively comparing the frequency band likelihood ratio of each frequency band with the likelihood ratio threshold, and if the frequency band likelihood ratio of at least one frequency band is larger than the likelihood ratio threshold, determining that the audio frame to be detected is a voice signal.

5. The method of claim 1, wherein the step of calculating the total energy of the audio frame to be detected from the band energies of the bands comprises:

and adding the frequency band energy of each frequency band, and setting the obtained result as the total energy of the audio frame to be detected.

6. The method of claim 1, wherein the step of obtaining likelihood ratios for the frequency bands based on the frequency band energy characteristics, a pre-established speech signal model, and the non-speech signal model comprises:

the method comprises the steps of obtaining voice energy characteristics of a voice signal through the voice signal model, obtaining non-voice energy characteristics of a non-voice signal through the non-voice signal model, obtaining voice similarity between the frequency band energy characteristics and the voice energy characteristics, and obtaining non-voice similarity between the frequency band energy characteristics and the non-voice energy characteristics, wherein the ratio of the voice similarity to the non-voice similarity is the likelihood ratio of a current frequency band.

7. The method of claim 1, wherein the speech signal model and the non-speech signal model are mixture gaussian models.

8. A speech signal detection apparatus, the apparatus comprising:

9. A speech signal detection apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.