CN116229983A

CN116229983A - Method and device for determining snore signals, electronic equipment and storage medium

Info

Publication number: CN116229983A
Application number: CN202211591068.2A
Authority: CN
Inventors: 张虎
Original assignee: Shenzhen Breo Technology Co Ltd
Current assignee: Shenzhen Breo Technology Co Ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-06-06
Also published as: CN115206329A; CN116312546A; CN115206329B

Abstract

The invention provides a method, a device, electronic equipment and a storage medium for determining a snore signal, and belongs to the field of signal processing. The method comprises the following steps: acquiring a sound signal, detecting voice segments of the sound signal, and detecting at least two voice segments before a target voice segment; if the initial frame of the target voice section is detected in the voice signal, determining whether the first current frame is the initial frame of the snore section in the target voice section; if the snore segment starting frame is determined, determining a snore segment ending frame from the next frame of the snore segment starting frame; and taking the sound signal between the snore segment starting frame and the snore segment ending frame as a snore signal. By adopting the invention, real-time snore signal detection can be realized on an embedded system.

Description

Method and device for determining snore signals, electronic equipment and storage medium

The application is a divisional application with the application number of CN 202211118325.0 and the name of 'a method, a device, electronic equipment and a storage medium for determining snore signals' and the application date of 2022, 09 and 15.

Technical Field

The present invention relates to the field of signal processing, and in particular, to a method, an apparatus, an electronic device, and a storage medium for determining a snore signal.

Background

Snoring is a loud, nasal sound that is emitted after falling asleep. Snoring not only plagues the user himself, affects the companion, but also threatens the health of the user himself.

For snore segment detection, at present, there are few real-time detection methods which can be applied to an embedded system, most of the methods are post-processing, model and parameter fixing methods, such as a Gaussian mixture model and a hidden Markov model, more complex machine learning algorithms such as clustering and support vector machines, and even deep learning algorithms, the algorithms are complex in calculation and poor in real-time performance, and the deep learning algorithms even need server-level calculation and storage resources and are difficult to realize on the embedded system.

Therefore, a method for realizing real-time snore signal detection on an embedded system is needed.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides a method, a device, electronic equipment and a storage medium for determining a snore signal. The technical proposal is as follows:

according to an aspect of the present invention, there is provided a method of determining a snore signal, the method comprising:

Acquiring a sound signal, detecting voice segments of the sound signal, and detecting at least two voice segments before a target voice segment;

if the initial frame of the target voice section is detected in the voice signal, determining whether the first current frame is the initial frame of the snore section in the target voice section;

if the snore segment starting frame is determined, determining a snore segment ending frame from the next frame of the snore segment starting frame;

and taking the sound signal between the snore segment starting frame and the snore segment ending frame as a snore signal.

According to another aspect of the present invention there is provided an apparatus for determining a snore signal, the apparatus comprising:

the voice signal acquisition module is used for acquiring a voice signal, detecting voice segments of the voice signal and detecting at least two voice segments before a target voice segment;

the first determining module is used for determining whether the first current frame is a snore segment starting frame or not in the target human voice segment if the starting frame of the target human voice segment is detected in the sound signal;

the second determining module is used for determining a snore segment ending frame from the next frame of the snore segment starting frame if the snore segment starting frame is determined; and taking the sound signal between the snore segment starting frame and the snore segment ending frame as a snore signal.

According to another aspect of the present invention, there is provided an electronic apparatus including:

a processor; and

a memory in which a program is stored,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the above-described method of determining a snore signal.

According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above method of determining a snore signal.

In the invention, after the voice signal is collected in real time, the voice section detection can be carried out in real time, the voice section in the voice signal is determined, and after at least two voice sections are determined, the snore section starting frame and the snore section ending frame are detected frame by frame from the starting frame of the target voice section. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore signal detection can be realized on an embedded system.

Drawings

Further details, features and advantages of the invention are disclosed in the following description of exemplary embodiments with reference to the following drawings, in which:

FIG. 1 shows a schematic diagram I of a flow chart of a method for determining a snore signal provided in accordance with an exemplary embodiment of the present invention;

FIG. 2 shows a second schematic diagram of a flow chart of a method for determining a snore signal according to an exemplary embodiment of the present invention;

FIG. 3 illustrates a flowchart of a human voice segment determination method provided in accordance with an exemplary embodiment of the present invention;

FIG. 4 shows a third schematic diagram of a flow chart of a method for determining a snore signal provided in accordance with an exemplary embodiment of the present invention;

FIG. 5 shows a schematic diagram IV of a flowchart of a method for determining snore signals provided in accordance with an exemplary embodiment of the present invention;

FIG. 6 is a diagram five illustrating a flowchart of a method for determining a snore signal according to an exemplary embodiment of the present invention;

FIG. 7 is a flowchart illustrating a snore segment start frame determination method provided in accordance with an exemplary embodiment of the present invention;

FIG. 8 shows a schematic block diagram of a snore segment detection device according to an exemplary embodiment of the present invention;

fig. 9 shows a block diagram of an exemplary electronic device that can be used to implement an embodiment of the invention.

Detailed Description

Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The embodiment of the invention provides a method for determining snore signals, which can be realized by a massager, an intelligent bracelet, a mobile terminal, an intelligent sound box and other devices adopting an embedded system. The method provided by the embodiment of the invention can be completed by any one of the devices, or can be completed by a plurality of devices together, and the invention is not limited to this.

The method will be described with reference to a flowchart of a method for determining a snore signal shown in fig. 1.

As shown in fig. 1, the method includes the following steps 101-104.

Step 101, acquiring a sound signal, detecting a voice section of the sound signal, and detecting at least two voice sections before a target voice section;

step 102, if a starting frame of a target voice section is detected in the voice signal, determining whether the first current frame is a snore section starting frame in the target voice section;

step 103, if the snore segment starting frame is determined, determining a snore segment ending frame from the next frame of the snore segment starting frame;

Step 104, the sound signal between the snore segment start frame and the snore segment end frame is used as the snore signal.

Wherein, the first current frame includes two cases, and the first case refers to: the first current frame is the initial frame of the target voice section; the second case refers to: the first current frame is any intermediate frame after the start frame of the target human voice segment.

Optionally, for the first case, that is, the first current frame is the start frame of the target voice segment, the process of determining whether the first current frame is the snore segment start frame in the target voice segment may be as follows:

in the target voice section, whether the starting frame of the target voice section is the snore section starting frame is determined based on the similarity of preset characteristics and/or snore characteristics of at least two voice sections before the target voice section and the starting frame of the target voice section.

In some possible embodiments, the similarity score of the start frame of the target voice segment may be calculated based on the similarity of the preset characteristics of the at least two voice segments and the start frame of the target voice segment, and whether the start frame of the target voice segment is the snore segment start frame may be determined based on the similarity score of the start frame of the target voice segment.

In some possible embodiments, the snore characteristic score of the starting frame of the target voice segment may be calculated based on the snore characteristics of at least two voice segments and the starting frame of the target voice segment, and whether the starting frame of the target voice segment is the snore segment starting frame may be determined based on the snore characteristic score of the starting frame of the target voice segment.

In some possible embodiments, the similarity score of the starting frame of the target vocal segment may be calculated based on the similarity of the preset characteristics of the starting frames of the at least two vocal segments and the target vocal segment; calculating a snore characteristic score of a starting frame of the target voice segment based on snore characteristics of at least two voice segments and the starting frame of the target voice segment; based on the similarity score and the snore characteristic score, it is determined whether the starting frame of the target human voice segment is a snore segment starting frame.

Optionally, the similarity of the preset characteristics may include any one or more of the following: energy similarity, spectrum similarity, mel-cepstral coefficient similarity, noise similarity;

optional snoring characteristics include any one or more of the following: snore energy characteristics, snore fundamental frequency characteristics and snore zero crossing rate characteristics.

For energy similarity, the process of calculating a similarity score for a starting frame of a target human voice segment may be as follows: and respectively calculating the energy average value of each voice section in at least two voice sections, calculating the difference ratio of the energy average value between each voice section, and if the difference ratio is in the range of the preset ratio, increasing the corresponding first similarity score in the similarity score of the initial frame of the target voice section.

For spectral similarity, the process of calculating a similarity score for a starting frame of a target human voice segment may be as follows: determining the number of first wave peaks meeting the wave peak position coincidence condition in the frequency spectrums of at least two voice segments, and if the number of the first wave peaks is larger than a first number threshold value, increasing a corresponding second similarity score in the similarity score of the initial frame of the target voice segment.

For mel-frequency coefficient similarity, the process of calculating the similarity score for the starting frame of the target human voice segment may be as follows: and calculating the Mel cepstrum coefficient of each of the at least two voice segments and the Mel cepstrum coefficient of the initial frame of the target voice segment, respectively calculating first cross-correlation coefficients between any two Mel cepstrum coefficients in the initial frames of the at least two voice segments and the target voice segment, and if the first cross-correlation coefficients are both larger than a first preset threshold, adding a corresponding third similarity score in the similarity score of the initial frame of the target voice segment.

For noise similarity, the process of calculating the similarity score for the starting frame of the target human voice segment may be as follows: respectively calculating second cross-correlation coefficients between any one mel cepstrum coefficient in the initial frames of at least two voice segments and the target voice segment and the mel cepstrum coefficient of preset noise, and subtracting a corresponding fourth similarity score from the similarity score of the initial frame of the target voice segment if any one of the second cross-correlation coefficients is larger than a second preset threshold; if the second cross-correlation coefficients are smaller than the third preset threshold value, a corresponding fifth similarity score is added to the similarity score of the initial frame of the target voice section.

Wherein for spectral similarity, the process of calculating a similarity score for a starting frame of the target human voice segment may further comprise:

if the number of the first wave peaks is larger than a preset number threshold, determining the number of the second wave peaks meeting the wave peak position coincidence condition compared with at least two voice segments in the frequency spectrum of the initial frame of the target voice segment, and if the number of the second wave peaks is larger than the second number threshold, increasing a corresponding sixth similarity score in the similarity score of the initial frame of the target voice segment.

For snore energy characteristics, the process of calculating a snore characteristic score for a starting frame of a target segment of human voice may be as follows:

respectively calculating a low-frequency energy mean value and a medium-frequency energy mean value of each of at least two voice segments, and calculating a first ratio of the low-frequency energy mean value to the medium-frequency energy mean value of each voice segment, wherein if the first ratio is larger than a ratio threshold, a corresponding first snore characteristic score is added in a snore characteristic score of a starting frame of a target voice segment;

and calculating a low-frequency energy mean value and a medium-frequency energy mean value of the initial frame of the target voice section, and calculating a second ratio of the low-frequency energy mean value to the medium-frequency energy mean value of the initial frame of the target voice section, wherein if the second ratio is larger than a ratio threshold, a corresponding second snore characteristic score is added in the snore characteristic score of the initial frame of the target voice section.

For the snore base frequency characteristic, the process of calculating the snore characteristic score of the starting frame of the target human voice segment may be as follows:

respectively calculating the fundamental frequency of each voice section in at least two voice sections, and calculating the fundamental frequency difference value between each voice section, if the fundamental frequency of each voice section is in the snore fundamental frequency range and the fundamental frequency difference value is in the fundamental frequency difference value range, adding a corresponding third snore characteristic score in the snore characteristic score of the initial frame of the target voice section;

and calculating the fundamental frequency of the initial frame of the target voice section, and if the fundamental frequency of the initial frame of the target voice section is in the snore fundamental frequency range, adding a corresponding fourth snore characteristic score into the snore characteristic score of the initial frame of the target voice section.

For the snore zero-crossing rate characteristic, the process of calculating the snore characteristic score of the starting frame of the target human voice segment may be as follows:

respectively calculating the average zero-crossing rate of each voice section in at least two voice sections, and if the average zero-crossing rate of each voice section is within the range of the zero-crossing rate of the snore, adding a corresponding fifth snore characteristic score into the snore characteristic score of the initial frame of the target voice section;

and calculating the zero crossing rate of the starting frame of the target voice section, and if the zero crossing rate of the starting frame of the target voice section is in the snore zero crossing rate range, adding a corresponding sixth snore characteristic score into the snore characteristic score of the starting frame of the target voice section.

Wherein the ratio threshold used may be updated as follows:

when the initial frame of the target voice section is determined to be the initial frame of the snore section, updating a ratio threshold based on the first ratio and the second ratio; and/or

If the middle frame of the target voice section is determined to be the snore section starting frame after the starting frame of the target voice section is determined to be not the snore section starting frame, the ratio threshold is updated based on the average value ratio of the low-frequency energy average value and the medium-frequency energy average value of the spectrum average value between the starting frame of the target voice section and the snore section starting frame.

Optionally, for the second case, that is, any intermediate frame after the first current frame is the starting frame of the target voice segment, the process of determining whether the first current frame is the starting frame of the snore segment in the target voice segment may be as follows:

if the middle frame of the target voice section is determined to meet any one or more of the following snore starting conditions in the target voice section, the middle frame of the target voice section is determined to be the snore starting frame.

The snore starting conditions include: snore duration conditions, mel cepstrum coefficient conditions, non-noise conditions, zero crossing rate conditions, spectral conditions, fundamental frequency conditions.

Determining that the intermediate frame of the target voice segment meets the snore duration condition means that: and determining that the frame number between the starting frame and the first current frame of the target voice section is in the snore duration range, and/or determining that the duration between the starting frame and the ending frame of the previous voice section is greater than the snore interval duration.

Determining that the intermediate frame of the target human voice segment meets the mel-frequency cepstrum coefficient condition means that: and calculating a first mel-frequency coefficient mean value between the initial frame of the target voice section and the intermediate frame of the target voice section, calculating a first cross-correlation coefficient of the first mel-frequency coefficient mean value and the mel-frequency coefficient reference value, and determining that the first cross-correlation coefficient is larger than a first preset threshold value.

Determining that the intermediate frame of the target human voice segment meets the non-noise condition means that: and calculating a second cross-correlation coefficient of the first mel-frequency cepstrum coefficient mean value and the mel-frequency cepstrum coefficient of the preset noise, and determining that the second cross-correlation coefficient is smaller than a second preset threshold.

Determining that the intermediate frame of the target voice segment meets the zero crossing rate condition means that: and calculating the average zero crossing rate between the starting frame of the target voice section and the intermediate frame of the target voice section, and determining that the average zero crossing rate is within a preset zero crossing rate range.

Determining that the intermediate frame of the target human voice segment meets the spectrum condition means that: and calculating a frequency spectrum average value between the initial frame of the target voice section and the middle frame of the target voice section, calculating a low-frequency energy average value and a medium-frequency energy average value on the frequency spectrum average value, calculating a mean value ratio of the low-frequency energy average value and the medium-frequency energy average value, and determining that the mean value ratio is larger than a ratio threshold.

Determining that the intermediate frame of the target human voice segment meets the fundamental frequency condition means that: and determining a fundamental frequency value on the spectrum mean value, wherein the fundamental frequency value of the spectrum mean value is determined to be in the snore fundamental frequency range.

Wherein the ratio threshold used may be updated as follows:

when the initial frame of the target voice section is determined to be the snore section initial frame, respectively calculating the low-frequency energy mean value and the medium-frequency energy mean value of each voice section in at least two voice sections before the target voice section, and calculating a first ratio of the low-frequency energy mean value to the medium-frequency energy mean value of each voice section; calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of a starting frame of the target voice section, and calculating a second ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of the starting frame of the target voice section; updating the ratio threshold based on the first ratio and the second ratio; and/or

And when the middle frame of the target voice section is determined to be the snore section starting frame, updating the ratio threshold based on the average value ratio of the low-frequency energy average value and the medium-frequency energy average value of the spectrum average value between the starting frame of the target voice section and the snore section starting frame.

The mel-frequency coefficient reference value used therein may be updated as follows:

When the initial frame of the target voice section is determined to be the snore section initial frame, updating a mel cepstrum coefficient reference value based on the mel cepstrum coefficient of each voice section in at least two voice sections before the target voice section and the mel cepstrum coefficient of the initial frame of the target voice section; and/or

And updating the mel-frequency coefficient reference value based on the mel-frequency coefficient mean value between the starting frame of the target voice section and the snore section starting frame when the intermediate frame of the target voice section is determined to be the snore section starting frame.

Optionally, the condition for determining whether the intermediate frame of the target voice segment is the snore segment start frame may be: at least one snore signal is detected in at least two segments of human voice.

Alternatively, the processing of step 103 may be as follows:

starting from the next frame of the snore segment starting frame, the second current frame is subjected to the following detection processing:

determining whether the second current frame is an end frame of the target voice segment;

if the second current frame is the ending frame of the target voice section, taking the second current frame as the ending frame of the snore section;

if the second current frame is not the ending frame of the target voice section, determining whether the second current frame is the ending frame of the snore section, and if so, taking the second current frame as the ending frame of the snore section; if not, the detection processing is carried out on the next frame of the second current frame.

Alternatively, the process of determining whether the second current frame is a snore segment end frame may be as follows:

and determining whether the second current frame is a snore segment ending frame based on the mel-cepstrum coefficient reference value and the mel-cepstrum coefficient of the second current frame.

The specific steps can be as follows: calculating a second mel-frequency cepstrum coefficient mean value between the snore segment starting frame and the second current frame, and calculating a third cross-correlation coefficient of the second mel-frequency cepstrum coefficient mean value and the mel-frequency cepstrum coefficient reference value; if the third cross-correlation coefficient is smaller than the third preset threshold value, the second current frame is determined to be the snore segment ending frame.

In the embodiment of the invention, after the voice signal is collected in real time, the voice section detection can be performed in real time, the voice section in the voice signal is determined, and after at least two voice sections are determined, the snore section starting frame and the snore section ending frame are detected frame by frame from the starting frame of the target voice section. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore signal detection can be realized on an embedded system.

The method for determining the snore signal according to the present embodiment will be described with reference to a flowchart of a method for determining the snore signal shown in fig. 2. In this embodiment, the similarity of the preset characteristics is mainly used to determine whether the initial frame of the target voice segment is the initial frame of the snore segment.

As shown in fig. 2, the method for determining a snore signal provided in this embodiment includes the following steps 201-204.

Step 201, acquiring a sound signal, performing voice segment detection on the sound signal, and detecting at least two voice segments before a target voice segment.

In one possible embodiment, an audio acquisition device may be provided on the device, which may acquire sound signals in real time. And carrying out signal framing, pre-emphasis, pulse noise removal, mean value removal and other processing on the collected sound signals, further carrying out voice section detection on each frame of sound signals, and determining voice sections in the sound signals.

Specifically, referring to the flowchart of the human voice segment determining method shown in fig. 3, it may be determined whether each frame of the voice signal is a start frame with reference to the following steps 301 to 303, and an end frame is determined based on the following step 304.

In step 301, the current sound signal frame and the previous preset number of sound signal frames are used as the first target sound frame segment to be detected.

In one possible implementation, after determining the first target sound frame segment, a frame energy, a zero crossing rate, a spectral entropy, and a transform smoothed spectral entropy value of each sound signal frame may also be calculated. The method for transforming the smooth spectrum entropy value refers to performing spectrum entropy calculation on the power spectrum of the sound signal frame, and then smoothing and transforming the power spectrum, and the specific calculation method can be as follows: and calculating the direct spectrum entropy or the sectional spectrum entropy of the power spectrum of the sound signal frame, smoothing, taking the opposite number from the spectrum entropy value, and adding a translation parameter to obtain the transformed smooth spectrum entropy value of the sound signal frame.

Step 302, determining whether the current sound signal frame is a human sound frame based on the preset noise, the frame energy, the spectral entropy, the transformed smoothed spectral entropy, the harmonic characteristic and the zero-crossing rate of the current sound signal frame.

In one possible implementation, when it is determined that the current sound signal frame meets the human sound frame condition, the current sound signal frame is determined to be a human sound frame, and step 303 is performed to perform the determination based on the first target sound frame segment. The above-mentioned human voice frame conditions may include any one or more of the following: the frame energy is larger than the frame energy threshold, the spectral entropy is larger than the spectral entropy threshold, the transformation smooth spectral entropy is larger than the transformation smooth spectral entropy threshold, the frame energy is provided with harmonic characteristics, the zero crossing rate is larger than the human voice frame zero crossing rate threshold, and the noise characteristics of the frame energy and the entropy are not in accordance with preset noise. The preset noise may include automobile engine noise, automobile horn noise, alarm noise, and the like, which is not limited in this embodiment.

When it is determined that the current sound signal frame does not satisfy any of the above-mentioned human voice frame conditions, it is determined that the current sound signal frame is not a human voice frame, and the next frame is detected, and steps 301 to 303 are repeatedly performed.

The respective thresholds may be set according to a large number of experiments, may be dynamically updated, and may be set to be within a reasonable range according to a large number of experiments, and the present embodiment is not limited thereto.

As an example, the above-mentioned threshold value may be updated when a difference from a value to be compared is large, for example, when a transition smooth spectrum entropy value of a current frame is large in difference from a corresponding transition smooth spectrum entropy value threshold value, the transition smooth spectrum entropy value threshold value may be updated. The method comprises the following steps:

when the value to be compared is larger than the corresponding threshold value and the phase difference reaches a set value, increasing the threshold value based on a first weighting factor corresponding to the increased threshold value;

and when the value to be compared is smaller than the corresponding threshold value and the phase difference reaches the set value, reducing the threshold value based on a second weighting factor corresponding to the reduced threshold value.

In one possible implementation, the threshold value may be adjusted based on the following equation (1):

Th_new ＝ (1-a)*Th+a*x (1)

wherein Th_new is the updated threshold, th is the pre-updated threshold, x is the above-mentioned value to be compared, a is the weighting factor, and the value range of a is (0, 1). In the increasing process of the threshold value, a may be a1; in the reduction processing of the threshold value, a may be a2.

The first weighting factor a1 corresponding to the increasing threshold and the second weighting factor a2 corresponding to the decreasing threshold may be empirically set, and the values of a1 and a2 may be the same or different, which is not limited in this embodiment.

Step 303, determining whether the current sound signal frame is a start frame based on the preset noise, the frame energy of the first target sound frame segment, the spectral entropy, the transform smoothing spectral entropy, the harmonic characteristic and the zero-crossing rate.

In one possible implementation manner, in the first target sound frame segment, it may be determined whether each sound signal frame satisfies the above-mentioned voice frame condition, and the number of frames satisfying the voice frame condition may be counted. When the number of frames satisfying the human voice frame condition is greater than the frame number threshold, the current voice signal frame may be taken as a start frame.

When the number of frames satisfying the voice frame condition is not greater than the threshold number of frames, it can be considered that the current voice signal frame is not a voice frame, and then the next frame is detected, and steps 301 to 303 are repeatedly performed.

In the method for determining a snore signal provided in this embodiment, before the target voice segment is detected, at least two voice segments may be detected. Thus, if two segments of the human voice have not been detected in the voice signal, after the start frame is detected in step 303, step 304 is entered to detect the end frame; if at least two segments have been detected in the sound signal, after detecting the start frame in step 303, step 202 described below is entered to determine if the start frame is a snore segment start frame.

Step 304, taking the current sound signal frame and the previous preset number of sound signal frames as a second target sound frame segment to be detected, and judging whether the current sound signal frame is an end frame or not based on preset noise, frame energy, spectral entropy, transformation smooth spectral entropy, harmonic characteristic and zero-crossing rate of the second target sound frame segment.

In one possible implementation, after a certain number of frames have passed after the start frame is detected in step 303, it is determined, frame by frame, whether the current sound signal frame is an end frame. The specific processing may be that in the second target sound frame segment, whether each sound signal frame satisfies the above-mentioned human sound frame condition is determined, and the number of frames satisfying the human sound frame condition is counted.

When the number of frames satisfying the human voice frame condition is greater than the threshold number of frames, the current voice signal frame is considered to be a human voice frame, and the process of step 304 is performed on the next frame, and it is determined whether the frame voice signal is an end frame.

And when the number of frames meeting the condition of the voice frame is not more than the threshold of the number of frames, taking the current voice signal frame as an end frame. At this time, the sound signal between the start frame and the end frame may be regarded as one human voice segment.

Thereafter, the process of steps 301-304 described above may be continued for the next frame of the ending frame, with individual segments of the sound signal being determined, and snore segment detection being performed in the segments based on steps 202-204 as follows.

Step 202, if the start frame of the target voice segment is detected in the voice signal, determining whether the start frame of the target voice segment is a snore segment start frame based on the similarity of the preset characteristics of at least two voice segments and the start frame of the target voice segment.

In one possible embodiment, when two segments of speech are detected in the sound signal, snoring segment detection may be performed in the segments of speech starting from the third segment of speech. Optionally, in order to ensure accuracy of snore segment detection, in at least two segments preceding the target segment, a length of each segment satisfies a preset length condition (e.g., within a range of snore lengths), and an interval time between two adjacent segments is greater than a preset interval time.

In this embodiment, the voice section for detecting the snore section is referred to as a target voice section, and the same applies hereinafter.

Compared with the voice sent by speaking, the snore has certain repeatability, so that each time the initial frame of the target voice section is detected, whether at least two voice sections before the target voice section have similarity can be judged. If at least two voice segments in front of the target voice segment have similarity, the probability that the collected sound belongs to snore is higher. And, it can judge whether the initial frame of the target voice section has similarity with the previous at least two voice sections. If the initial frame of the target voice section has similarity with at least two previous voice sections, which indicates that the initial frame has higher probability of belonging to snore, the initial frame can be used as the initial frame of the snore section.

Optionally, in order to further improve the accuracy of determining whether the starting frame is the snore segment starting frame, the processing in step 202 may be as follows:

calculating a similarity score of the starting frame based on the similarity of the preset characteristics of the starting frames of the at least two voice segments and the target voice segment;

based on the similarity score of the starting frame, whether the starting frame of the target voice segment is the snore segment starting frame is determined.

The similarity of the preset characteristics may include any one or more of the following: energy similarity, spectral similarity, mel-cepstral coefficient similarity, noise similarity.

In one possible implementation, it may be determined whether a starting frame of the target vocal section has similarity with the energies of the previous at least two vocal sections, and if so, the similarity score of the starting frame may be increased.

It may be determined whether the starting frame of the target vocal section has similarity to the spectra of the previous at least two vocal sections, and if so, the similarity score of the starting frame may be increased.

It may be determined whether a start frame of the target human voice segment has similarity to mel-frequency cepstrum coefficients of at least two preceding human voice segments, and if so, the similarity score of the start frame may be increased.

By presetting the noise characteristics of the noise, whether the initial frame of the target voice section and at least two previous voice sections accord with the noise characteristics or not can be judged, if so, the similarity score of the initial frame or the voice sections can be reduced if the initial frame or the voice sections possibly belong to the noise.

The specific calculation method of each of the above-described similarities will be described below.

For energy similarity, the process of calculating a similarity score for a starting frame may be as follows: and respectively calculating the energy average value of each voice section in at least two voice sections, calculating the difference ratio of the energy average value between each voice section, and if the difference ratio is in the range of the preset ratio, increasing the corresponding first similarity score in the similarity score of the initial frame. As an example, two voice segments before the target voice segment are taken for processing, the frame energy of each frame in each voice segment is arranged in descending order, a certain number of frames (such as the previous 10 frames) are taken, and the average value of the frame energy is calculated and used as the energy average value of the voice segment; calculating the energy mean value difference between two voice segments, and calculating the difference ratio of the energy mean value difference (for example, calculating the ratio relationship between the energy mean value difference and the energy mean value of any voice segment); if the difference ratio is within a predetermined ratio range (e.g., 0 to 0.1), indicating that the energy between the two segments has similarity, a 1-point similarity score may be added to the similarity score for the starting frame.

For spectral similarity, the process of calculating a similarity score for a starting frame may be as follows: determining the number of first wave peaks meeting the wave peak position coincidence condition in the frequency spectrums of at least two voice segments, and if the number of the first wave peaks is larger than a first number threshold value, increasing a corresponding second similarity score in the similarity score of the initial frame. As an example, two voice segments before the target voice segment are taken for processing, the average frequency spectrum of each voice segment is calculated, the average frequency spectrum is searched for the maximum peak value with a certain number (such as 5) and the corresponding position of the peak value is recorded; comparing peak positions of two voice segments one by one, recording the number of similar peak positions (namely the peak position superposition condition), and adding 1 score to the similarity score of the initial frame if the number of similar peak positions exceeds a first number threshold value, which indicates that the frequency spectrums between the two voice segments have similarity.

For mel-frequency coefficient similarity, the process of calculating a similarity score for a starting frame may be as follows: and calculating the Mel cepstrum coefficient of each of the at least two voice segments and the Mel cepstrum coefficient of the initial frame of the target voice segment, respectively calculating the first cross-correlation coefficient between any two Mel cepstrum coefficients in the initial frames of the at least two voice segments and the target voice segment, and if the first cross-correlation coefficients are both larger than a first preset threshold, increasing the corresponding third similarity score in the similarity score of the initial frame. As an example, taking a starting frame of a target voice section and two voice sections before the starting frame, and respectively calculating the mel-frequency cepstrum coefficient of each voice section, wherein for the voice section, the mel-frequency cepstrum coefficient of each frame can be calculated, the average value is obtained, and the mel-frequency cepstrum coefficient average value is taken as the mel-frequency cepstrum coefficient of the voice section; calculating a cross-correlation coefficient between mel-frequency coefficients of two voice segments, and adding 1 score to a similarity score of a starting frame if the cross-correlation coefficient is larger than a first preset threshold value, which indicates that the two voice segments have the mel-frequency coefficient similarity; and calculating a cross-correlation coefficient between the initial frame and the mel-frequency cepstrum coefficient of the first voice section and a cross-correlation coefficient between the initial frame and the mel-frequency cepstrum coefficient of the second voice section, wherein if the cross-correlation coefficients are larger than a first preset threshold value, the similarity score of the initial frame and the two voice sections can be added by 1 score if the cross-correlation coefficients are larger than the first preset threshold value, so that the initial frame and the two voice sections have the mel-frequency cepstrum coefficient similarity.

For noise similarity, the process of calculating the similarity score for the starting frame may be as follows: respectively calculating second cross-correlation coefficients between any one mel cepstrum coefficient in the initial frames of at least two voice segments and the target voice segment and the mel cepstrum coefficient of preset noise, and subtracting a corresponding fourth similarity score from the similarity score of the initial frame if any one of the second cross-correlation coefficients is larger than a second preset threshold; if the second cross-correlation coefficients are smaller than the third preset threshold value, the corresponding fifth similarity score is increased in the similarity score of the initial frame. As an example, taking a start frame of a target voice section and two voice sections before the start frame, and respectively calculating mel-frequency cepstrum coefficients of the start frame and each voice section and mel-frequency cepstrum coefficients of preset noise; calculating a cross-correlation coefficient between preset noise and the mel-frequency cepstrum coefficient of the first voice section and a cross-correlation coefficient between the preset noise and the mel-frequency cepstrum coefficient of the second voice section, if any cross-correlation coefficient is larger than a second preset threshold value, showing that any voice section has similarity with the preset noise, subtracting 1 score from the similarity score of the initial frame, and if the cross-correlation coefficient is smaller than a third preset threshold value, showing that the two voice sections have no similarity with the preset noise, adding 1 score to the similarity score of the initial frame; and calculating a cross-correlation coefficient between the preset noise and the mel-frequency cepstrum coefficient of the initial frame, wherein if the cross-correlation coefficient is larger than a second preset threshold value, the similarity score of the initial frame can be subtracted by 1 score, and if the cross-correlation coefficient is smaller than a third preset threshold value, the similarity score of the initial frame can be added by 1 score.

Optionally, for the spectrum similarity, the process of calculating the similarity score of the starting frame may further include: if the number of the first wave peaks is larger than a preset number threshold, determining the number of second wave peaks meeting the wave peak position coincidence condition compared with at least two voice segments in the frequency spectrum of the initial frame, and if the number of the second wave peaks is larger than the second number threshold, increasing a corresponding sixth similarity score in the similarity score of the initial frame. As an example, in the above process of calculating the spectrum similarity, if the number of peak positions close to each other in the two voice segments exceeds the first number threshold, the searching for the maximum certain number of peaks in the spectrum of the initial frame may be continued, and the corresponding positions of the peaks may be recorded; and comparing peak positions of the initial frame and the two voice segments one by one, recording the number of similar peak positions, and adding 1 score to the similarity score of the initial frame if the number of similar peak positions exceeds a second number threshold value, which indicates that the initial frame has similarity with the frequency spectrums of the two voice segments.

After the calculation of the similarity is completed, the similarity score of the initial frame can be judged, and if the similarity score is larger than a score threshold value, the initial frame can be used as the initial frame of the snore segment; if the similarity score is not greater than the score threshold, the starting frame may be considered not to be the snore segment starting frame.

Note that, the score corresponding to each similarity may be adjusted, for example, if the accuracy of the mel-frequency coefficient similarity is higher, the third score corresponding to the mel-frequency coefficient similarity may be adjusted higher, which is not limited in this embodiment.

If the starting frame of the target voice segment is not the snore segment starting frame, two processing modes can be adopted thereafter:

the first processing mode is as follows: detecting the ending frame of the target human voice segment frame by frame, and detecting the snore segment no longer;

the second processing mode is as follows: in the process of detecting the end frame of the target voice segment frame by frame after the start frame of the target voice segment is determined, whether the intermediate frame is the snore segment start frame is determined.

The specific process of detecting the end frame of the target voice segment frame by frame in the two processing manners is the same as that in step 304, and will not be repeated here.

For the first processing mode, if the initial frame of the target voice section is not the initial frame of the snore section, the target voice section is considered to not contain snore, then the snore section detection is not continued in the target voice section, and after the next voice section is detected, the snore section detection is performed on the voice section.

In order to improve the accuracy of snore segment detection, the second processing method can be adopted to determine whether the intermediate frame of the target human voice segment is the initial frame of the snore segment. The specific processing mode can adopt the existing snore segment detection method, and the specific method for determining whether the intermediate frame of the target human voice segment is the initial frame of the snore segment is not limited in this embodiment. Optionally, in order to further improve the accuracy of snore segment detection, a method for determining whether the intermediate frame of the target voice segment is the initial frame of the snore segment will be described in another embodiment, which is not described herein.

Alternatively, the condition of the second processing mode may be that at least one snore signal is detected in a human voice segment detected before the target human voice segment. As an example, assuming that the start frame is not determined to be the snore segment start frame in the detected third human voice segment, the snore segment detection is not continued in the third human voice segment; if the detected fourth voice section is determined to be the snore section initial frame, determining a corresponding snore signal in the fourth voice section, and adding 1 to the snore count; when the fifth voice section is detected, if the initial frame is not determined to be the snore section initial frame, the second processing mode can be entered to determine whether the intermediate frame is the snore section initial frame through the snore counting. The reason for adopting the alternative scheme is that when the snore signal is not detected, the probability that the middle frame of the human voice section is the snore section starting frame is smaller, and in order to reduce the consumption of calculation resources, the judgment of the snore section starting frame can be carried out on the middle frame after the snore signal is detected.

For convenience of description, in this embodiment, a current frame involved in detecting a snore segment start frame is referred to as a first current frame, and a current frame involved in detecting a snore segment end frame is referred to as a second current frame.

Step 203, if the snore segment start frame is determined, determining the snore segment end frame from the next frame of the snore segment start frame.

Specifically, the process of step 203 may be as follows:

In a possible implementation, after determining the snore segment start frame, it may be determined whether the next frame is an end frame, and the specific process is the same as that of step 304, which is not described herein. If an end frame is determined, and the snore is considered to end at this time, a second current frame (i.e., the end frame) may be used as the snore segment end frame.

If the second current frame is not determined to be an end frame, it is determined whether the second current frame is a snore segment end frame.

Alternatively, it may be determined whether the second current frame is a snore segment end frame based on the mel-frequency coefficient reference value and the mel-frequency coefficient of the second current frame.

Wherein, the mel-frequency spectrum reference value refers to the mel-frequency spectrum reference value of snore.

In one possible implementation, a mel-cepstral coefficient of the second current frame may be calculated, and a cross-correlation coefficient of the mel-cepstral coefficient and a mel-cepstral coefficient reference value may be calculated, to determine whether the cross-correlation coefficient is less than a corresponding threshold. If the cross-correlation coefficient is smaller than the corresponding threshold, the second current frame is considered not to belong to snore, and the second current frame is taken as a snore segment ending frame.

In another possible implementation, the mel-frequency coefficient of each frame between the snore segment start frame and the second current frame may be calculated and averaged to serve as the mel-frequency coefficient average of the snore segment start frame to the second current frame. And calculating a cross-correlation coefficient of the mel-frequency cepstrum coefficient mean value and the mel-frequency cepstrum coefficient reference value, and judging whether the cross-correlation coefficient is smaller than a corresponding threshold. If the cross-correlation coefficient is smaller than the corresponding threshold, the second current frame is considered not to belong to snore, and the second current frame is taken as a snore segment ending frame. Compared with the previous embodiment, the snore ending frame can be determined after being delayed by a certain frame number, and on the basis, the snore information contained in the snore signal can be ensured to be larger.

If the cross-correlation coefficient is not less than the corresponding threshold, the process of step 203 is continued for the next frame.

Or if the cross-correlation coefficient is not smaller than the corresponding threshold, further judging whether the frame number from the initial frame of the snore segment to the second current frame is larger than the snore frame number threshold. If the frame number is greater than the snore frame number threshold, the second current frame can be used as a snore segment ending frame; if the number of frames is not greater than the snore frame number threshold, processing of step 203 above continues for the next frame.

Optionally, the mel-frequency coefficient reference value may be adaptively updated, and the two updating modes are provided in this embodiment as follows:

mode one: and when the initial frame of the target voice section is determined to be the snore section initial frame, updating the reference value of the mel-frequency spectrum coefficient based on the mel-frequency spectrum coefficient of each voice section in at least two voice sections before the target voice section and the mel-frequency spectrum coefficient of the initial frame of the target voice section.

In one possible implementation manner, when it is determined that the starting frame of the target voice section is the snore section starting frame, the mel-frequency coefficient of each frame of the at least two voice sections may be calculated, the mel-frequency coefficient of the starting frame (i.e., the snore section starting frame) is calculated, and the average value is obtained after the average value is obtained, so as to obtain the mel-frequency coefficient average value of the at least two voice sections and the starting frame. And taking the mean value of the mel-frequency spectrum coefficient as a reference value of the mel-frequency spectrum coefficient used by the target human voice segment.

Mode two: if the intermediate frame of the target voice segment is determined to be the snore segment start frame after the start frame of the target voice segment is determined not to be the snore segment start frame, updating the mel cepstrum coefficient reference value based on the mel cepstrum coefficient mean value between the start frame of the target voice segment and the snore segment start frame.

In one possible implementation manner, when it is determined that the intermediate frame of the target voice segment is the snore segment start frame, the mel cepstrum coefficient of each frame from the start frame of the target voice segment to the snore segment start frame can be calculated, and the mel cepstrum coefficient average value is obtained after the average value is obtained. Further, the mel-frequency coefficient reference value may be replaced with the mel-frequency coefficient mean value.

And for snore segment detection of the next target human voice segment, if the snore segment starting frame is determined, determining the mel cepstrum coefficient reference value used by the target human voice segment based on the processing, so as to realize self-adaptive updating of the mel cepstrum coefficient reference value.

Because the snore signals of different target voice segments may have different mel-frequency cepstrum coefficients, through the updating process, the mel-frequency cepstrum coefficient reference value used by each target voice segment can be adapted to the snore of the target voice segment, so that the accuracy of determining the ending frame of the snore segment is improved, and the accuracy of detecting the snore segment is improved.

Step 204, the sound signal between the snore segment start frame and the snore segment end frame is used as the snore signal.

In one possible embodiment, a snore segment start frame and a snore segment end frame may be recorded. When the snore signal needs to be analyzed, the corresponding snore signal can be intercepted according to the snore section starting frame and the snore section ending frame. The present embodiment is not limited to the subsequent processing of the snore signal.

The embodiment can obtain the following beneficial effects:

(1) After the voice signals are collected in real time, voice segment detection can be performed in real time, voice segments in the voice signals are determined, and after at least two voice segments are determined, snore segment detection is performed from a starting frame of a target voice segment based on the similarity of preset characteristics of the at least two voice segments and the starting frame of the target voice segment. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore segment detection can be realized on an embedded system.

(2) In the process of calculating the similarity score of the initial frame, the similarity of various preset characteristics such as energy similarity, spectrum similarity, mel-frequency cepstrum coefficient similarity, noise similarity and the like is adopted, so that the detection precision is improved, and the anti-noise performance is good.

(3) The reference value of the mel-frequency spectrum coefficient is adaptively adjusted, so that the reference value of the mel-frequency spectrum coefficient used by each target voice section can be matched with the snore of the target voice section, the accuracy of determining the ending frame of the snore section is improved, and the accuracy of detecting the snore section is improved.

The method for determining the snore signal according to the present embodiment will be described with reference to a flowchart of a method for determining the snore signal shown in fig. 4. In this embodiment, the snore feature is mainly used to determine whether the initial frame of the target voice segment is the initial frame of the snore segment.

As shown in fig. 4, the method for determining a snore signal provided in this embodiment includes the following steps 401-404.

Step 401, acquiring a sound signal, performing voice segment detection on the sound signal, and detecting at least two voice segments before a target voice segment.

The specific method for determining the voice section may refer to the flowchart of the voice section determining method shown in fig. 3 in the foregoing embodiment, and the specific implementation is the same as the foregoing embodiment, which is not repeated herein.

After determining the ending frame of the voice segment, the process of steps 301-304 described above may be continued for the next frame of the ending frame, each voice segment in the sound signal may be determined, and snore segment detection may be performed in the voice segment based on steps 402-404 as follows.

In the method for determining a snore signal provided in this embodiment, before the target voice segment is detected, at least two voice segments may be detected. Therefore, if two voice segments are not detected in the voice signal, the end frame can be detected after the start frame is detected; if at least two human voice segments have been detected in the voice signal, after detecting the start frame, step 402 described below is entered to determine whether the start frame is a snore segment start frame.

In step 402, if the start frame of the target voice segment is detected in the voice signal, it is determined whether the start frame of the target voice segment is a snore segment start frame based on snore characteristics of at least two voice segments and the start frame of the target voice segment.

The snore has certain characteristics, so that each time the starting frame of the target voice section is detected, it can be judged whether at least two voice sections and the starting frame before the target voice section have the snore characteristics. If the snore feature is provided, the probability that the collected sound belongs to snore is high, and the initial frame can be used as the initial frame of the snore segment.

Optionally, to further improve the accuracy of determining whether the starting frame is the snore segment starting frame, the processing in step 402 may be as follows:

calculating a snore characteristic score of the starting frame based on the snore characteristics of the starting frames of the at least two segments and the target segment;

Based on the snore characteristic score of the starting frame, determining whether the starting frame of the target human voice segment is the snore segment starting frame.

Wherein the snore characteristics may include any one or more of the following: snore energy characteristics, snore fundamental frequency characteristics and snore zero crossing rate characteristics.

The principle of snoring characteristics is as follows:

snore has the characteristic that the low-frequency energy is higher than the medium-frequency energy, so that whether the snore belongs to the snore can be judged by judging whether the low-frequency energy is higher than the medium-frequency energy. Wherein, the low frequency can be a frequency range from 50Hz to 400Hz, the intermediate frequency can be a frequency range from 400Hz to 4000Hz, and the settings of the low frequency and the intermediate frequency are matched with snore.

The base frequency of the snore is in a certain range, and the base frequency statistics can be carried out on the pre-collected snore signals to determine the base frequency range of the snore. Therefore, whether the snore fundamental frequency characteristic is provided or not can be judged through the snore fundamental frequency range.

The zero crossing rate of the snore is also in a certain range, and the frame zero crossing rate statistics can be carried out on the pre-collected snore signals to determine the zero crossing rate range of the snore. Therefore, whether the snore zero crossing rate characteristic is provided or not can be judged through the snore zero crossing rate range.

The specific calculation method of the snore characteristics will be described below.

For snore energy characteristics, the process of calculating a snore characteristic score for a starting frame may be as follows:

respectively calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of each of at least two voice segments, and calculating a first ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of each voice segment, wherein if the first ratio is greater than a ratio threshold, a corresponding first snore characteristic score is increased in a snore characteristic score of a starting frame;

and calculating a low-frequency energy mean value and a medium-frequency energy mean value of the initial frame, and calculating a second ratio of the low-frequency energy mean value to the medium-frequency energy mean value of the initial frame, and if the second ratio is larger than a ratio threshold, increasing a corresponding second snore characteristic score in the snore characteristic score of the initial frame.

As an example, two segments of voice preceding the target voice segment are taken for processing. For each voice segment, each frame of voice signal can be subjected to fast fourier transform (FFT, fast Fourier Transform) to obtain a fourier transform result, and the frequency band energy average value is calculated by the following formula (2):

wherein A_mean is a frequency band energy mean value, n1 is a frequency point corresponding to a lower limit frequency, n2 is a frequency point corresponding to an upper limit frequency, and f (n) is an absolute value of an nth frequency spectrum value on a Fourier transform result. Alternatively, the above |f (n) | may be replaced by the square of the absolute value of the nth spectral value on the fourier transform result, i.e., |f (n) | ² 。

When calculating the low-frequency energy mean value, the a_mean may represent the low-frequency energy mean value, n1 is a frequency point (e.g. 50 Hz) corresponding to the low-frequency lower limit frequency, and n2 is a frequency point (e.g. 400 Hz) corresponding to the low-frequency upper limit frequency.

When calculating the intermediate frequency energy average value, the above a_mean may represent the intermediate frequency energy average value, n1 is a frequency point (e.g. 400 Hz) corresponding to the intermediate frequency lower limit frequency, and n2 is a frequency point (e.g. 4000 Hz) corresponding to the intermediate frequency upper limit frequency.

Furthermore, a ratio of the low frequency energy mean value to the intermediate frequency energy mean value of each human voice segment may be calculated, which is referred to as a first ratio in this embodiment. If the first ratio of the two voice segments is greater than the ratio threshold, the two voice segments have snore energy characteristics, and the snore characteristic score of the initial frame can be added by 1 minute.

Similarly, the ratio of the low frequency energy mean to the intermediate frequency energy mean of the starting frame may be calculated, and this embodiment is referred to as a second ratio. If the second ratio of the initial frame is larger than the ratio threshold, the snore characteristic score of the initial frame can be added by 1 minute.

For snore base characteristics, the process of calculating a snore characteristic score for a starting frame may be as follows:

respectively calculating the fundamental frequency of each voice section in at least two voice sections, and calculating the fundamental frequency difference value between each voice section, if the fundamental frequency of each voice section is in the snore fundamental frequency range and the fundamental frequency difference value is in the fundamental frequency difference value range, adding a corresponding third snore characteristic score in the snore characteristic score of the initial frame;

And calculating the fundamental frequency of the initial frame, and if the fundamental frequency of the initial frame is in the snore fundamental frequency range, adding a corresponding fourth snore characteristic score into the snore characteristic score of the initial frame.

As an example, two voice segments before the target voice segment are taken for processing, an average frequency spectrum of each voice segment is calculated, and a fundamental frequency is obtained in the average frequency spectrum and is used as the fundamental frequency of the voice segment, so that a fundamental frequency difference value of the two voice segments is calculated. If the fundamental frequency of each human voice section is in the snore fundamental frequency range, the two human voice sections are provided with the snore fundamental frequency characteristics; if the base frequency difference value is in the base frequency difference value range, the base frequencies of the two voice sections are indicated to have certain continuity and accord with the characteristics of snore, and the snore characteristic score of the initial frame can be added by 1 minute.

If the base frequency of the initial frame is in the snore base frequency range, the initial frame has the snore base frequency characteristic, and the snore characteristic score of the initial frame can be added by 1 minute.

For the snore zero crossing rate characteristic, the process of calculating the snore characteristic score for the starting frame may be as follows:

respectively calculating the average zero-crossing rate of each voice section in at least two voice sections, and if the average zero-crossing rate of each voice section is within the range of the zero-crossing rate of the snore, adding a corresponding fifth snore characteristic score in the snore characteristic score of the initial frame;

And calculating the zero crossing rate of the initial frame, and if the zero crossing rate of the initial frame is in the snore zero crossing rate range, adding a corresponding sixth snore characteristic score in the snore characteristic score of the initial frame.

As an example, two segments of voice preceding the target voice segment are taken for processing. For each voice section, the zero crossing rate of each frame of signal can be calculated, and the average value is calculated to obtain the average zero crossing rate. If the average zero crossing rate of each voice section is in the range of the snore zero crossing rate, the two voice sections are provided with the snore zero crossing rate characteristic, and the snore characteristic score of the initial frame can be added by 1 minute. Alternatively, the process of calculating the average zero-crossing rate may be that, for each voice segment, the zero-crossing rate of each frame of signal is calculated, the zero-crossing rates of each frame of signal are ordered in descending order, and the zero-crossing rate of the previous set number of frames (for example, 10 frames) is taken to obtain an average value as the average zero-crossing rate of the voice segment.

Similarly, the zero-crossing rate is calculated for the starting frame. If the zero crossing rate of the initial frame is within the snore zero crossing rate range, the snore zero crossing rate characteristic of the initial frame is indicated, and the snore characteristic score of the initial frame can be added by 1 minute.

After the calculation of the snore characteristics is completed, the snore characteristic score of the initial frame can be judged, and if the snore characteristic score is larger than a score threshold, the initial frame can be used as the initial frame of the snore segment; if the snore characteristic score is not greater than the score threshold, the starting frame may be considered not to be the snore segment starting frame.

Note that, the score corresponding to each snore characteristic may be adjusted, for example, if the accuracy of the snore energy characteristic is higher, the score corresponding to the snore energy characteristic may be adjusted to be higher, which is not limited in this embodiment.

Optionally, the ratio threshold of the low-frequency energy mean value and the intermediate-frequency energy mean value used in the calculation process of the snore energy characteristic may be adaptively updated, and the two updating modes are provided in this embodiment as follows:

mode one: when the initial frame of the target voice section is determined to be the initial frame of the snore section, the ratio threshold in the frequency spectrum condition can be updated based on the first ratio and the second ratio determined in the process.

In one possible implementation manner, when it is determined that the starting frame of the target voice segment is the snore segment starting frame, weighted average may be performed based on the first ratio of at least two voice segments before the target voice segment, the second ratio of the target voice segment, and the ratio threshold to be updated, and the obtained weighted average is used as the updated ratio threshold and used as the ratio threshold for the next target voice segment.

Mode two: if the middle frame of the target voice section is determined to be the snore section starting frame after the starting frame of the target voice section is determined to be not the snore section starting frame, the ratio threshold is updated based on the average value ratio of the low-frequency energy average value and the medium-frequency energy average value of the spectrum average value between the starting frame of the target voice section and the snore section starting frame.

In one possible implementation manner, when it is determined that the intermediate frame of the target voice segment is the snore segment start frame, the average value ratio and the ratio threshold to be updated may be weighted averaged, and the obtained weighted average value is used as the updated ratio threshold and used as the ratio threshold for the next target voice segment.

Wherein the ratio threshold value, which has never been updated, may be set to an initial value. The ratio threshold may also be set with upper and lower limits based on a number of tests such that the ratio threshold remains within a reasonable range.

Because the energy ratio of the low frequency to the medium frequency of the snore of each user is different, and the snore of the same user at different moments is also possibly different, through the processing, when the initial frame of the target human voice segment is determined to be the initial frame of the snore segment, the ratio threshold of the low frequency to the medium frequency can be adaptively adjusted, so that the ratio threshold is adaptive to the current snore, the accuracy of determining the initial frame of the snore segment is improved, and the accuracy of detecting the snore segment is improved.

Step 403, if it is determined that the snore segment starts, determining the snore segment ends from the next frame of the snore segment starts.

Specifically, the process of step 403 may be as follows:

If the cross-correlation coefficient is not less than the corresponding threshold, the process of step 403 is continued for the next frame.

Or if the cross-correlation coefficient is not smaller than the corresponding threshold, further judging whether the frame number from the initial frame of the snore segment to the second current frame is larger than the snore frame number threshold. If the frame number is greater than the snore frame number threshold, the second current frame can be used as a snore segment ending frame; if the number of frames is not greater than the snore frame number threshold, processing continues with step 403 as described above for the next frame.

Step 404, taking the sound signal between the snore segment start frame and the snore segment end frame as the snore signal.

The embodiment can obtain the following beneficial effects:

(1) After the voice signals are collected in real time, voice segment detection can be performed in real time, voice segments in the voice signals are determined, and after at least two voice segments are determined, snore segment detection is performed from a starting frame of a target voice segment based on snore characteristics of the at least two voice segments and the starting frame of the target voice segment. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore segment detection can be realized on an embedded system.

(2) In the process of calculating the snore characteristic score of the initial frame, various snore characteristics such as snore energy characteristics, snore fundamental frequency characteristics, snore zero-crossing rate characteristics and the like are adopted, so that the detection accuracy is improved, and the anti-noise performance is good.

(3) And the ratio threshold value of the low frequency and the medium frequency is adaptively adjusted, so that the ratio threshold value is adaptive to the current snore, and the accuracy of determining the initial frame of the snore segment is improved.

(4) And the reference value of the mel-frequency spectrum coefficient is adaptively adjusted, so that the reference value of the mel-frequency spectrum coefficient used by each target voice section can be matched with the snore of the target voice section, and the accuracy of determining the ending frame of the snore section is improved.

The method for determining the snore signal according to the present embodiment will be described with reference to a flowchart of a method for determining the snore signal shown in fig. 5. In this embodiment, the similarity of the preset characteristics and the snore characteristics are used to determine whether the initial frame of the target voice segment is the initial frame of the snore segment.

As shown in fig. 5, the method for determining a snore signal provided in this embodiment includes the following steps 501-504.

Step 501, a sound signal is acquired, a voice section is detected on the sound signal, and at least two voice sections are detected before a target voice section.

After determining the ending frame of the voice segment, the process of steps 301-304 described above may be continued for the next frame of the ending frame, each voice segment in the sound signal may be determined, and snore segment detection may be performed in the voice segment based on steps 502-504 as follows.

In the method for determining a snore signal provided in this embodiment, before the target voice segment is detected, at least two voice segments may be detected. Therefore, if two voice segments are not detected in the voice signal, the end frame can be detected after the start frame is detected; if at least two segments have been detected in the audio signal, after detecting the start frame, step 502 is entered to determine whether the start frame is a snore segment start frame.

Step 502, if a start frame of the target voice segment is detected in the voice signal, determining whether the start frame of the target voice segment is a snore segment start frame based on the similarity of the preset characteristics and the snore characteristics of at least two voice segments and the start frame of the target voice segment.

Compared with the voice sent by speaking, the snore has certain repeatability, so that each time the initial frame of the target voice section is detected, whether at least two voice sections before the target voice section have similarity can be judged. If at least two voice segments in front of the target voice segment have similarity, the probability that the collected sound belongs to snore is higher. And, it can judge whether the initial frame of the target voice section has similarity with the previous at least two voice sections. If the initial frame of the target voice section has similarity with at least two previous voice sections, the probability that the initial frame belongs to snore is higher.

The snore has certain characteristics, so that the snore characteristics of at least two segments and the starting frame before the target segment can be calculated whenever the starting frame of the target segment is detected. If the snore characteristics are provided, the probability that the collected sound belongs to snore is high.

Therefore, the similarity of the preset characteristics and the snore characteristics can be adopted to judge whether the initial frame of the target voice section is the snore section initial frame, and if the similarity and the snore characteristics are provided, the initial frame can be used as the snore section initial frame.

Alternatively, the processing in step 502 may be as follows:

based on the similarity score and the snore characteristic score, it is determined whether the starting frame of the target human voice segment is a snore segment starting frame.

Wherein the similarity of the preset characteristics comprises any one or more of the following: energy similarity, spectral similarity, mel-cepstral coefficient similarity, noise similarity. The snoring characteristics include any one or more of the following: snore energy characteristics, snore fundamental frequency characteristics and snore zero crossing rate characteristics.

In one possible implementation, if at least two voice segments have the above similarity with the starting frame of the target voice segment, a corresponding score may be added to the similarity score of the starting frame; if the starting frames of the at least two voice segments and the target voice segment have the snore characteristics, the corresponding score can be increased in the snore characteristic score of the starting frame.

Further, it may be determined whether the similarity score is greater than a first score threshold and the snore characteristic score is greater than a second score threshold. If the initial frame is larger than the snore feature, the initial frame can be used as the snore segment initial frame.

Alternatively, the similarity score and the snore characteristic score may be weighted and averaged to obtain a snore score for the starting frame, and determine whether the snore score is greater than a third score threshold. If the frame is larger than the original frame, the original frame is used as the original frame of the snore segment, and the original frame has the similarity and the snore characteristics.

The similarity and snore characteristics will be described separately.

For similarity:

it may be determined whether a starting frame of the target vocal section has similarity to the energies of the previous at least two vocal sections, and if so, the similarity score of the starting frame may be increased.

For energy similarity, the process of calculating a similarity score for a starting frame may be as follows: and respectively calculating the energy average value of each voice section in at least two voice sections, calculating the difference ratio of the energy average value between each voice section, and if the difference ratio is in the range of the preset ratio, increasing the corresponding first similarity score in the similarity score of the initial frame. As an example, two voice segments before the target voice segment are taken for processing, the frame energy of each frame in each voice segment is arranged in descending order, a certain number of frames (such as the previous 10 frames) are taken, and the average value of the frame energy is calculated and used as the energy average value of the voice segment; calculating the energy mean value difference between two voice segments, and calculating the difference ratio of the energy mean value difference (for example, calculating the ratio relationship between the energy mean value difference and the energy mean value of any voice segment); if the difference ratio is within a preset ratio range (e.g., 0-0.1), indicating that the energy between the two voice segments has similarity, the similarity score of the starting frame may be added by 1 score.

For mel-frequency coefficient similarity, the process of calculating a similarity score for a starting frame may be as follows: and calculating the mel cepstrum coefficients of the initial frames of the at least two voice segments and the target voice segment, respectively calculating first cross-correlation coefficients between any two mel cepstrum coefficients in the initial frames of the at least two voice segments and the target voice segment, and if the first cross-correlation coefficients are both larger than a first preset threshold, increasing a corresponding third similarity score in the similarity score of the initial frame. As an example, taking a starting frame of a target voice section and two voice sections before the starting frame, and respectively calculating the mel-frequency cepstrum coefficient of each voice section, wherein for the voice section, the mel-frequency cepstrum coefficient of each frame can be calculated, the average value is obtained, and the mel-frequency cepstrum coefficient average value is taken as the mel-frequency cepstrum coefficient of the voice section; calculating a cross-correlation coefficient between mel-frequency coefficients of two voice segments, and adding 1 score to a similarity score of a starting frame if the cross-correlation coefficient is larger than a first preset threshold value, which indicates that the two voice segments have the mel-frequency coefficient similarity; and calculating a cross-correlation coefficient between the initial frame and the mel-frequency cepstrum coefficient of the first voice section and a cross-correlation coefficient between the initial frame and the mel-frequency cepstrum coefficient of the second voice section, wherein if the cross-correlation coefficients are larger than a first preset threshold value, the similarity score of the initial frame and the two voice sections can be added by 1 score if the cross-correlation coefficients are larger than the first preset threshold value, so that the initial frame and the two voice sections have the mel-frequency cepstrum coefficient similarity.

For snore characteristics, the principle is as follows:

As an example, two segments of voice preceding the target voice segment are taken for processing. For each voice segment, each frame of voice signal can be subjected to fast fourier transform (FFT, fast Fourier Transform) to obtain a fourier transform result, and the frequency band energy average value is calculated through the above formula (2).

Note that, the above-mentioned respective scores may be adjusted, for example, if the accuracy of the snore energy characteristic is higher, the score corresponding to the snore energy characteristic may be increased, which is not limited in this embodiment.

In step 503, if the snore segment start frame is determined, the snore segment end frame is determined from the next frame of the snore segment start frame.

Specifically, the process of step 503 may be as follows:

If the cross-correlation coefficient is not less than the corresponding threshold, the process of step 503 is continued for the next frame.

Or if the cross-correlation coefficient is not smaller than the corresponding threshold, further judging whether the frame number from the initial frame of the snore segment to the second current frame is larger than the snore frame number threshold. If the frame number is greater than the snore frame number threshold, the second current frame can be used as a snore segment ending frame; if the number of frames is not greater than the snore frame number threshold, processing continues with step 503 as described above for the next frame.

Step 504, the sound signal between the snore segment start frame and the snore segment end frame is used as the snore signal.

The embodiment can obtain the following beneficial effects:

(1) After the voice signals are collected in real time, voice segment detection can be performed in real time, voice segments in the voice signals are determined, and after at least two voice segments are determined, snore segment detection is performed from a starting frame of a target voice segment based on the similarity of preset characteristics and snore characteristics of the starting frames of the at least two voice segments and the target voice segment. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore segment detection can be realized on an embedded system.

(2) In the process of calculating the similarity score of the initial frame, the similarity of various preset characteristics such as energy similarity, frequency spectrum similarity, mel-frequency cepstrum coefficient similarity, noise similarity and the like is adopted, and in the process of calculating the snore characteristic score of the initial frame, various snore characteristics such as snore energy characteristics, snore fundamental frequency characteristics, snore zero-crossing rate characteristics and the like are adopted, so that the detection precision is improved, and the noise resistance is better.

The method for determining the snore signal according to the present embodiment will be described with reference to a flowchart of a method for determining the snore signal shown in fig. 6. In this embodiment, the method mainly relates to determining whether an intermediate frame of a target voice segment is a snore segment start frame.

As shown in fig. 6, the method for determining a snore signal provided in this embodiment includes the following steps 601-604.

Step 601, acquiring a sound signal, detecting a voice section of the sound signal, and detecting at least two voice sections before a target voice section.

After determining the ending frame of the voice segment, the process of steps 301-304 described above may be continued for the next frame of the ending frame, each voice segment in the sound signal may be determined, and snore segment detection may be performed in the voice segment based on steps 602-604 as follows.

In the method for determining a snore signal provided in this embodiment, before the target voice segment is detected, at least two voice segments may be detected. Therefore, if two voice segments are not detected in the voice signal, the end frame can be detected after the start frame is detected; if at least two human voice segments have been detected in the voice signal, after detecting the start frame, step 602 is entered to determine whether the start frame is a snore segment start frame.

In step 602, if a start frame of the target voice segment is detected in the voice signal and the start frame is not a snore segment start frame, it is determined whether an intermediate frame of the target voice segment is a snore segment start frame.

In one possible embodiment, when two segments of voice are detected in the voice signal, a snore segment detection may be performed in the segments of voice starting from the third segment of voice. Optionally, in order to ensure the accuracy of detecting the snore segment by the snore segment, the length of each of at least two human voice segments before the target human voice segment meets a preset length condition (such as being within the range of the snore length), and the interval time between two adjacent human voice segments is longer than the preset interval time.

For the first case, the specific process of determining whether the initial frame is the snore segment initial frame may be an existing snore segment detection method, and the embodiment is not limited to the specific method. Alternatively, in order to further improve the accuracy of snore segment detection, the method for determining whether the initial frame of the target voice segment is the initial frame of the snore segment described in any of the above embodiments may be used, which is not described herein.

The second case described above, namely, the method of determining whether the middle frame of the target voice segment is the snore segment start frame, is improved. The method comprises the following steps:

If it is determined that the first current frame in the target voice segment meets any one or more of the following snore starting conditions, determining that the first current frame is a snore segment starting frame, where the snore starting conditions include: snore duration conditions, mel cepstrum coefficient conditions, non-noise conditions, zero crossing rate conditions, spectral conditions, fundamental frequency conditions.

Alternatively, the condition for the second case may be that at least one snore signal is detected in a human voice segment detected before the target human voice segment. As an example, assuming that the start frame is not determined to be the snore segment start frame in the detected third human voice segment, the snore segment detection is not continued in the third human voice segment; if the detected fourth voice section is determined to be the snore section initial frame, determining a corresponding snore signal in the fourth voice section, and adding 1 to the snore count; when the fifth voice section is detected, if the initial frame is not determined to be the snore section initial frame, the second processing mode can be entered to determine whether the intermediate frame is the snore section initial frame through the snore counting. The reason for adopting the alternative scheme is that when the snore signal is not detected, the probability that the middle frame of the human voice section is the snore section starting frame is smaller, and in order to reduce the consumption of calculation resources, the judgment of the snore section starting frame can be carried out on the middle frame after the snore signal is detected.

Referring to the snore segment start frame determination method flowchart shown in fig. 7, the process of step 602 may include the following steps 701-703.

Step 701, judging whether the first current frame is an end frame;

step 702, if the first current frame is not an end frame, determining whether the first current frame meets any one or more snore starting conditions;

step 703, if yes, using the first current frame as a snore segment start frame; if not, the process of steps 701-703 is performed on the next frame.

In a possible implementation manner, if the detected starting frame of the target voice segment is not the snore segment starting frame, after a certain number of frames, it is determined whether the first current frame is the ending frame of the target voice segment, and the specific process is the same as that in the step 304, and will not be repeated here.

If the first current frame is detected to be an end frame, the detection of the target voice section is ended, the voice section detection is carried out on the voice signal of the next frame, and whether the voice signal is a start frame or not is judged.

If the first current frame is detected not to be the ending frame, the snore segment detection is continuously carried out on the first current frame (namely the middle frame of the target human voice segment at the moment), namely whether the first current frame meets the snore starting condition is judged. If yes, the first current frame is used as a snore segment starting frame if the probability that the first current frame belongs to snore is high; if not, repeating the above process for the next frame, and judging whether the frame is the snore segment starting frame. And ending the process of determining the snore segment start frame until the snore segment start frame is determined or the snore segment end frame is determined.

The above-mentioned snore starting conditions will be described separately.

Determining that the first current frame meets the snore duration condition means that: and determining that the frame number between the starting frame and the first current frame of the target voice section is in the snore duration range, and/or determining that the duration between the starting frame and the ending frame of the previous voice section is greater than the snore interval duration.

Determining that the first current frame satisfies the mel-frequency cepstrum coefficient condition means: and calculating a first mel-frequency coefficient mean value between the starting frame and the first current frame of the target voice section, calculating a first cross-correlation coefficient of the first mel-frequency coefficient mean value and the mel-frequency coefficient reference value, and determining that the first cross-correlation coefficient is larger than a first preset threshold. The mel-frequency cepstrum coefficient of each frame between the initial frame and the first current frame can be calculated, so that the average value is calculated to obtain the first mel-frequency cepstrum coefficient average value, and the first cross-correlation coefficient of the first mel-frequency cepstrum coefficient average value and the mel-frequency cepstrum coefficient reference value is calculated. The mel-frequency coefficient reference value refers to a mel-frequency coefficient reference value of snore, and if the first cross-correlation coefficient is greater than a first preset threshold, the probability that the first current frame belongs to snore is indicated to be greater.

Determining that the first current frame satisfies the non-noise condition means that: and calculating a second cross-correlation coefficient of the first mel-frequency cepstrum coefficient mean value and a mel-frequency cepstrum coefficient of preset noise, and determining that the second cross-correlation coefficient is smaller than a second preset threshold. If the second cross correlation coefficient is smaller than the second preset threshold value, the probability that the first current frame belongs to noise is smaller.

Determining that the first current frame meets the zero crossing rate condition means that: and calculating the average zero crossing rate between the initial frame and the first current frame of the target voice section, and determining that the average zero crossing rate is within a preset zero crossing rate range. The preset zero crossing rate range may be a zero crossing rate range obtained by counting snore, and if the average zero crossing rate of each frame from the starting frame to the first current frame is within the preset zero crossing rate range, the probability that the first current frame belongs to snore is higher.

Determining that the first current frame meets the spectral condition means that: and calculating a frequency spectrum average value between the initial frame and the first current frame of the target voice section, calculating a low-frequency energy average value and a medium-frequency energy average value on the frequency spectrum average value, calculating a mean value ratio of the low-frequency energy average value and the medium-frequency energy average value, and determining that the mean value ratio is larger than a ratio threshold value. Snoring has the property that the low frequency energy is higher than the medium frequency energy, and therefore, if the average ratio of the low frequency energy average to the medium frequency energy average is greater than the ratio threshold, the probability that the first current frame belongs to snoring is higher.

Wherein, the low frequency can be a frequency range from 50Hz to 400Hz, the intermediate frequency can be a frequency range from 400Hz to 4000Hz, and the settings of the low frequency and the intermediate frequency are matched with snore.

Determining that the first current frame meets the fundamental frequency condition means that: and determining a fundamental frequency value on a spectrum mean value between the initial frame and the first current frame of the target voice section, and determining that the fundamental frequency value of the spectrum mean value is in the snore fundamental frequency range. The snore fundamental frequency range can be obtained based on fundamental frequency statistics of the snore. If the base frequency value of the spectrum mean value is in the range of the base frequency of the snore, the probability that the first current frame belongs to the snore is larger.

In one possible implementation manner, when it is determined that the starting frame of the target voice section is the snore section starting frame, the mel-frequency cepstrum coefficient of each frame of the at least two voice sections may be calculated, the mel-frequency cepstrum coefficient of the starting frame (which is also the snore section starting frame at this time) is calculated, and the average value is obtained to obtain the average value of the mel-frequency cepstrum coefficients of the at least two voice sections and the starting frame. And taking the mean value of the mel-frequency spectrum coefficient as a reference value of the mel-frequency spectrum coefficient used by the target human voice segment.

Mode two: and updating the mel-frequency coefficient reference value based on the mel-frequency coefficient mean value between the starting frame of the target voice section and the snore section starting frame when the intermediate frame of the target voice section is determined to be the snore section starting frame.

Optionally, the ratio threshold in the above spectrum condition may also be updated adaptively, and this embodiment provides two updating modes as follows:

mode one: when the initial frame of the target voice section is determined to be the snore section initial frame, respectively calculating the low-frequency energy mean value and the medium-frequency energy mean value of each voice section in at least two voice sections before the target voice section, and calculating a first ratio of the low-frequency energy mean value to the medium-frequency energy mean value of each voice section; calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of the initial frame, and calculating a second ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of the initial frame; the ratio threshold in the spectral condition is updated based on the first ratio and the second ratio.

If the method provided by any of the embodiments determines that the starting frame of the voice segment is the snore starting frame, the ratio threshold in the spectrum condition may be updated by using the calculated first ratio and second ratio, and the first ratio and second ratio do not need to be calculated again.

Mode two: and when the middle frame of the target voice section is determined to be the snore starting frame, updating a ratio threshold based on the average ratio of the low-frequency energy average value and the medium-frequency energy average value of the spectrum average value.

Step 603, if the snore segment start frame is determined, determining the snore segment end frame from the next frame of the snore segment start frame.

Specifically, the process of step 603 may be as follows:

Wherein the mel-frequency coefficient reference value may be the same as the mel-frequency coefficient reference value used in determining the snore segment start frame.

If the cross-correlation coefficient is not less than the corresponding threshold, the process of step 603 is continued for the next frame.

Or if the cross-correlation coefficient is not smaller than the corresponding threshold, further judging whether the frame number from the initial frame of the snore segment to the second current frame is larger than the snore frame number threshold. If the frame number is greater than the snore frame number threshold, the second current frame can be used as a snore segment ending frame; if the number of frames is not greater than the snore frame number threshold, processing continues with step 603 as described above for the next frame.

Step 604, taking the sound signal between the snore segment start frame and the snore segment end frame as the snore signal.

The embodiment can obtain the following beneficial effects:

(1) After the voice signals are collected in real time, voice segment detection can be performed in real time, voice segments in the voice signals are determined, and after at least two voice segments are determined, a snore segment start frame and a snore segment end frame are detected frame by frame from the start frame of a target voice segment. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore segment detection can be realized on an embedded system.

(2) In the method, whether the intermediate frame of the target human voice segment is the snore segment initial frame is determined, and multiple snore initial conditions such as a snore duration condition, a mel cepstrum coefficient condition, a non-noise condition, a zero crossing rate condition, a frequency spectrum condition, a fundamental frequency condition and the like are adopted to comprehensively judge, so that the detection precision is improved, and the anti-noise performance is better.

(4) And the reference value of the mel-frequency cepstrum coefficient is adaptively adjusted, so that the reference value of the mel-frequency cepstrum coefficient used by each target voice section can be matched with the snore of the target voice section, and the accuracy of determining the initial frame and the ending frame of the snore section is improved.

The embodiment of the invention also provides a device for determining the snore signal, which is used for realizing any method for determining the snore signal. As shown in the schematic block diagram of fig. 8, the apparatus 800 for determining a snore signal includes: the method comprises an acquisition module 801, a first determination module 802 and a second determination module 803.

An obtaining module 801, configured to obtain a sound signal, perform voice segment detection on the sound signal, and detect at least two voice segments before a target voice segment;

A first determining module 802, configured to determine, if a start frame of a target voice segment is detected in the voice signal, whether a first current frame is a snore segment start frame in the target voice segment;

a second determining module 803, configured to determine, if a snore segment start frame is determined, a snore segment end frame from a frame next to the snore segment start frame; and taking the sound signal between the snore segment starting frame and the snore segment ending frame as a snore signal.

Optionally, when the first current frame is the start frame of the target voice segment, the first determining module 802 is configured to:

and in the target voice section, determining whether the starting frame of the target voice section is a snore section starting frame or not based on the similarity of preset characteristics and/or snore characteristics of the at least two voice sections and the starting frame of the target voice section.

Optionally, the first determining module 802 is configured to:

calculating a similarity score of a starting frame of the target voice segment based on the similarity of preset characteristics of the at least two voice segments and the starting frame of the target voice segment; determining whether the starting frame of the target voice segment is a snore segment starting frame based on the similarity score of the starting frame of the target voice segment; or (b)

Calculating a snore characteristic score of a starting frame of the target voice segment based on the snore characteristics of the at least two voice segments and the starting frame of the target voice segment; and determining whether the starting frame of the target voice section is the starting frame of the snore section based on the snore characteristic score of the starting frame of the target voice section.

Optionally, the first determining module 802 is configured to:

calculating a similarity score of a starting frame of the target voice segment based on the similarity of preset characteristics of the at least two voice segments and the starting frame of the target voice segment;

calculating a snore characteristic score of a starting frame of the target voice segment based on the snore characteristics of the at least two voice segments and the starting frame of the target voice segment;

and determining whether the starting frame of the target voice segment is a snore segment starting frame based on the similarity score and the snore characteristic score.

Optionally, the similarity of the preset characteristics includes any one or more of the following: energy similarity, spectrum similarity, mel-cepstral coefficient similarity, noise similarity;

the snore characteristics include any one or more of the following: snore energy characteristics, snore fundamental frequency characteristics and snore zero crossing rate characteristics.

Optionally, the first determining module 802 is configured to:

for the energy similarity, respectively calculating the energy average value of each voice section in the at least two voice sections, and calculating the difference ratio of the energy average value between each voice section, if the difference ratio is in the range of the preset ratio, adding a corresponding first similarity score in the similarity score of the initial frame of the target voice section;

for spectrum similarity, determining the first wave crest number meeting the wave crest position coincidence condition in the spectrums of the at least two voice segments, and if the first wave crest number is larger than a first number threshold value, increasing a corresponding second similarity score in the similarity score of the initial frame of the target voice segment;

for the mel-frequency coefficient similarity, calculating the mel-frequency coefficient of each of the at least two voice segments and the mel-frequency coefficient of the initial frame of the target voice segment, respectively calculating first cross-correlation coefficients between any two mel-frequency coefficients of the at least two voice segments and the initial frame of the target voice segment, and if the first cross-correlation coefficients are all larger than a first preset threshold, adding a corresponding third similarity score in the similarity score of the initial frame of the target voice segment;

For noise similarity, respectively calculating second cross-correlation coefficients between any one mel-cepstrum coefficient in the initial frames of the at least two voice segments and the target voice segment and the mel-cepstrum coefficient of preset noise, and subtracting a corresponding fourth similarity score from the similarity score of the initial frame of the target voice segment if any one of the second cross-correlation coefficients is larger than a second preset threshold; and if the second cross-correlation coefficients are smaller than a third preset threshold value, increasing a corresponding fifth similarity score in the similarity score of the initial frame of the target voice section.

Optionally, the first determining module 802 is further configured to:

for the spectrum similarity, if the first wave crest number is larger than a preset number threshold, determining the second wave crest number meeting the wave crest position superposition condition compared with the at least two voice segments in the spectrum of the initial frame of the target voice segment, and if the second wave crest number is larger than a second number threshold, adding a corresponding sixth similarity score in the similarity score of the initial frame of the target voice segment.

Optionally, the first determining module 802 is configured to:

for snore energy characteristics:

Respectively calculating a low-frequency energy mean value and a medium-frequency energy mean value of each voice section in the at least two voice sections, and calculating a first ratio of the low-frequency energy mean value to the medium-frequency energy mean value of each voice section, wherein if the first ratio is larger than a ratio threshold, a corresponding first snore characteristic score is added in a snore characteristic score of a starting frame of the target voice section;

calculating a low-frequency energy mean value and a medium-frequency energy mean value of a starting frame of the target voice section, calculating a second ratio of the low-frequency energy mean value to the medium-frequency energy mean value of the starting frame of the target voice section, and if the second ratio is larger than the ratio threshold, adding a corresponding second snore characteristic score into the snore characteristic score of the starting frame of the target voice section;

for snore fundamental frequency characteristics:

respectively calculating the fundamental frequency of each voice section in the at least two voice sections, and calculating the fundamental frequency difference value between each voice section, if the fundamental frequency of each voice section is in the snore fundamental frequency range and the fundamental frequency difference value is in the fundamental frequency difference value range, adding a corresponding third snore characteristic score in the snore characteristic score of the initial frame of the target voice section;

Calculating the fundamental frequency of the initial frame of the target voice section, and if the fundamental frequency of the initial frame of the target voice section is within the snore fundamental frequency range, adding a corresponding fourth snore characteristic score into the snore characteristic score of the initial frame of the target voice section;

for the snore zero crossing rate characteristic:

respectively calculating the average zero-crossing rate of each voice section in the at least two voice sections, and if the average zero-crossing rate of each voice section is within the range of the zero-crossing rate of the snore, adding a corresponding fifth snore characteristic score in the snore characteristic score of the initial frame of the target voice section;

Optionally, the apparatus further includes an update module, where the update module is configured to:

updating the ratio threshold based on the first ratio and the second ratio when the initial frame of the target voice segment is determined to be the initial frame of the snore segment; and/or

Optionally, when the first current frame is any intermediate frame after the start frame of the target voice segment, the first determining module 802 is configured to:

if it is determined in the target voice segment that the intermediate frame of the target voice segment meets any one or more of the following snore starting conditions, determining that the intermediate frame of the target voice segment is a snore starting frame, where the snore starting conditions include: snore duration conditions, mel cepstrum coefficient conditions, non-noise conditions, zero crossing rate conditions, spectral conditions, fundamental frequency conditions.

Optionally, determining that the intermediate frame of the target voice segment meets the snore duration condition means that: determining that the frame number between the starting frame and the middle frame of the target voice section is within the snore duration range, and/or determining that the duration between the starting frame and the ending frame of the previous voice section is greater than the snore interval duration;

determining that the intermediate frame of the target human voice segment meets the mel-frequency cepstrum coefficient condition refers to: calculating a first mel-frequency coefficient mean value between a starting frame of the target voice section and an intermediate frame of the target voice section, calculating a first cross-correlation coefficient of the first mel-frequency coefficient mean value and a mel-frequency coefficient reference value, and determining that the first cross-correlation coefficient is larger than a first preset threshold;

Determining that the intermediate frame of the target human voice segment meets a non-noise condition means that: calculating a second cross-correlation coefficient of the first mel-cepstrum coefficient mean value and a mel-cepstrum coefficient of preset noise, and determining that the second cross-correlation coefficient is smaller than a second preset threshold;

determining that the intermediate frame of the target voice segment meets the zero crossing rate condition means that: calculating the average zero crossing rate between the initial frame of the target voice section and the middle frame of the target voice section, and determining that the average zero crossing rate is within a preset zero crossing rate range;

determining that the intermediate frame of the target human voice segment meets the spectrum condition means that: calculating a frequency spectrum average value between a starting frame of the target voice section and an intermediate frame of the target voice section, calculating a low-frequency energy average value and a medium-frequency energy average value on the frequency spectrum average value, calculating a mean value ratio of the low-frequency energy average value to the medium-frequency energy average value, and determining that the mean value ratio is larger than a ratio threshold;

determining that the intermediate frame of the target human voice segment meets the fundamental frequency condition means that: and determining a fundamental frequency value on the spectrum mean value, and determining that the fundamental frequency value of the spectrum mean value is in the snore fundamental frequency range.

Optionally, the updating module is configured to:

When the initial frame of the target voice section is determined to be the initial frame of the snore section, respectively calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of each voice section in at least two voice sections before the target voice section, and calculating a first ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of each voice section; calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of a starting frame of the target voice section, and calculating a second ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of the starting frame of the target voice section; updating the ratio threshold based on the first ratio and the second ratio; and/or

Optionally, at least one snore signal is detected in the at least two segments of human voice.

Optionally, the second determining module 803 is configured to:

starting from the next frame of the snore segment starting frame, carrying out the following detection processing on the second current frame:

determining whether the second current frame is an end frame of the target human voice segment;

If the second current frame is the ending frame of the target voice section, taking the second current frame as a snore section ending frame;

if the second current frame is not the ending frame of the target voice section, determining whether the second current frame is a snore section ending frame, and if so, taking the second current frame as the snore section ending frame; and if not, carrying out the detection processing on the next frame of the second current frame.

Optionally, the second determining module 803 is configured to:

Optionally, the updating module is configured to:

when the initial frame of the target voice section is determined to be the snore section initial frame, updating the mel cepstrum coefficient reference value based on the mel cepstrum coefficient of each voice section in at least two voice sections before the target voice section and the mel cepstrum coefficient of the initial frame of the target voice section; and/or

And when the middle frame of the target voice section is determined to be the snore section starting frame, updating the mel-frequency spectrum coefficient reference value based on the mel-frequency spectrum coefficient mean value between the starting frame of the target voice section and the snore section starting frame.

In the embodiment of the invention, after the voice signal is collected in real time, the voice section detection can be performed in real time, the voice section in the voice signal is determined, and after at least two voice sections are determined, the snore section starting frame and the snore section ending frame are detected frame by frame from the starting frame of the target voice section. Compared with the Gaussian mixture model, the hidden Markov model, the clustering model, the support vector machine and other machine learning algorithms, the calculation amount of the processing is smaller, and real-time snore segment detection can be realized on an embedded system.

The exemplary embodiment of the invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to an embodiment of the invention when executed by the at least one processor.

The exemplary embodiments of the present invention also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present invention.

The exemplary embodiments of the invention also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the invention.

With reference to fig. 9, a block diagram of an electronic device 900 that may be a server or a client of the present invention will now be described, which is an example of a hardware device that may be applied to aspects of the present invention. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the electronic device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 908 may include, but is not limited to, magnetic disks, optical disks. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above. For example, in some embodiments, the above-described method of determining snore signals may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. In some embodiments, the computing unit 901 may be configured to perform the above-described method of determining snore signals by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of determining a snore signal, the method comprising:

Acquiring a sound signal, detecting a voice section of the sound signal, and detecting at least two voice sections before a target voice section for snore section detection;

if the starting frame of the target voice section is detected in the voice signal, determining whether a first current frame is a snore section starting frame in the target voice section, wherein the first current frame is the starting frame of the target voice section or is any intermediate frame after the starting frame of the target voice section;

taking the sound signal between the snore segment start frame and the snore segment end frame as a snore signal;

wherein the determining the snore segment ending frame from the next frame of the snore segment starting frame comprises:

2. The method of claim 1, wherein when the first current frame is a start frame of the target voice segment, the determining in the target voice segment whether the first current frame is a snore segment start frame comprises:

in the target voice section, determining whether the starting frame of the target voice section is a snore section starting frame or not based on the similarity of preset characteristics and/or snore characteristics of the at least two voice sections and the starting frame of the target voice section;

the similarity of the preset characteristics comprises any one or more of the following: energy similarity, spectrum similarity, mel-cepstral coefficient similarity, noise similarity;

3. The method of claim 2, wherein determining whether the starting frame of the target human voice segment is a snore segment starting frame based on a similarity of preset characteristics or snore characteristics of the at least two human voice segments and the starting frame of the target human voice segment comprises:

4. The method of claim 2, wherein determining whether the starting frame of the target human voice segment is a snore segment starting frame based on a similarity of preset characteristics and snore characteristics of the at least two human voice segments and the starting frame of the target human voice segment comprises:

5. The method according to claim 3 or 4, wherein,

for energy similarity, the calculating a similarity score for a starting frame of the target human voice segment includes: respectively calculating the energy mean value of each voice section in the at least two voice sections, and calculating the difference ratio of the energy mean value between each voice section, if the difference ratio is in the preset ratio range, increasing the corresponding first similarity score in the similarity score of the initial frame of the target voice section;

for spectral similarity, the calculating a similarity score for a starting frame of the target human voice segment includes: determining the first wave crest number meeting the wave crest position coincidence condition in the frequency spectrums of the at least two voice segments, and if the first wave crest number is larger than a first number threshold value, increasing a corresponding second similarity score in the similarity score of the initial frame of the target voice segment;

for mel-frequency coefficient similarity, the calculating a similarity score for a starting frame of the target human voice segment includes: calculating the mel-frequency cepstrum coefficient of each of the at least two voice segments and the mel-frequency cepstrum coefficient of the initial frame of the target voice segment, respectively calculating first cross-correlation coefficients between any two mel-frequency cepstrum coefficients in the initial frames of the at least two voice segments and the target voice segment, and if the first cross-correlation coefficients are all larger than a first preset threshold, adding a corresponding third similarity score in the similarity score of the initial frame of the target voice segment;

For noise similarity, the calculating a similarity score for a starting frame of the target human voice segment includes: respectively calculating second cross-correlation coefficients between any one mel cepstrum coefficient in the initial frames of the at least two voice segments and the target voice segment and the mel cepstrum coefficient of preset noise, and subtracting a corresponding fourth similarity score from the similarity score of the initial frame of the target voice segment if any one of the second cross-correlation coefficients is larger than a second preset threshold; and if the second cross-correlation coefficients are smaller than a third preset threshold value, increasing a corresponding fifth similarity score in the similarity score of the initial frame of the target voice section.

6. The method of claim 5, wherein for spectral similarity, the calculating a similarity score for a starting frame of the target human voice segment further comprises:

if the number of the first wave peaks is larger than a preset number threshold, determining the number of second wave peaks meeting the wave peak position coincidence condition compared with the at least two voice segments in the frequency spectrum of the initial frame of the target voice segment, and if the number of the second wave peaks is larger than a second number threshold, adding a corresponding sixth similarity score in the similarity score of the initial frame of the target voice segment.

7. The method according to claim 3 or 4, wherein,

for the snore energy characteristic, the calculating the snore characteristic score of the initial frame of the target human voice segment includes:

for the snore base frequency characteristic, the calculating the snore characteristic score of the initial frame of the target human voice segment includes:

for the snore zero crossing rate characteristic, the calculating the snore characteristic score of the initial frame of the target human voice segment includes:

8. The method of claim 7, wherein the method further comprises:

when the initial frame of the target voice section is determined to be the snore section initial frame, carrying out weighted average on the basis of the first ratio of at least two voice sections before the target voice section, the second ratio of the target voice section and a ratio threshold to be updated, and taking the obtained weighted average as the updated ratio threshold; and/or

If it is determined that the middle frame of the target voice segment is the snore segment start frame after it is determined that the start frame of the target voice segment is not the snore segment start frame, then a weighted average is performed on the average value ratio of the low-frequency energy mean value and the intermediate-frequency energy mean value based on the spectrum mean value between the start frame of the target voice segment and the snore segment start frame to the ratio threshold to be updated, and the obtained weighted average value is used as the updated ratio threshold.

9. The method of claim 1, wherein when the first current frame is any intermediate frame after the start frame of the target voice segment, the determining in the target voice segment whether the first current frame is a snore segment start frame comprises:

if it is determined in the target voice segment that the intermediate frame of the target voice segment meets any one or more of the following snore starting conditions, determining that the intermediate frame of the target voice segment is a snore starting frame, where the snore starting conditions include: snore duration conditions, mel cepstrum coefficient conditions, non-noise conditions, zero crossing rate conditions, frequency spectrum conditions, fundamental frequency conditions;

wherein, the liquid crystal display device comprises a liquid crystal display device,

determining that the intermediate frame of the target voice segment meets the snore duration condition means that: determining that the frame number between the starting frame and the middle frame of the target voice section is within the snore duration range, and/or determining that the duration between the starting frame and the ending frame of the previous voice section is greater than the snore interval duration;

10. The method according to claim 9, wherein the method further comprises:

when the initial frame of the target voice section is determined to be the initial frame of the snore section, respectively calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of each voice section in at least two voice sections before the target voice section, and calculating a first ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of each voice section; calculating a low-frequency energy mean value and an intermediate-frequency energy mean value of a starting frame of the target voice section, and calculating a second ratio of the low-frequency energy mean value to the intermediate-frequency energy mean value of the starting frame of the target voice section; performing weighted average based on the first ratio of at least two voice segments before the target voice segment, the second ratio of the target voice segment and a ratio threshold to be updated, and taking the obtained weighted average as the updated ratio threshold; and/or

When the middle frame of the target voice section is determined to be the snore section starting frame, carrying out weighted average on the average value ratio of the low-frequency energy average value and the intermediate-frequency energy average value of the frequency spectrum average value between the starting frame of the target voice section and the snore section starting frame and the ratio threshold value to be updated, and taking the obtained weighted average value as the updated ratio threshold value.

11. The method of claim 9, wherein at least one snore signal is detected in at least two of the at least two segments of human voice.

12. The method of claim 1, wherein said determining whether the second current frame is a snore segment end frame comprises:

13. The method according to any one of claims 9 or 12, further comprising:

when the initial frame of the target voice section is determined to be the snore section initial frame, calculating the mel-frequency cepstrum coefficient of each frame in at least two voice sections before the target voice section and the mel-frequency cepstrum coefficient of the initial frame of the target voice section, and obtaining the average value of the mel-frequency cepstrum coefficients of the at least two voice sections and the initial frame after solving the average value to obtain an updated mel-frequency cepstrum coefficient reference value; and/or

And when the middle frame of the target voice section is determined to be the snore section starting frame, calculating the mel cepstrum coefficient of each frame from the starting frame of the target voice section to the snore section starting frame, and obtaining a mel cepstrum coefficient mean value after solving the mean value to serve as an updated mel cepstrum coefficient reference value.

14. An apparatus for determining a snore signal, the apparatus comprising:

the device comprises an acquisition module, a snore detection module and a snore detection module, wherein the acquisition module is used for acquiring sound signals, carrying out human voice section detection on the sound signals, and detecting at least two human voice sections before a target human voice section for snore section detection;

the first determining module is configured to determine, if a start frame of a target voice segment is detected in the voice signal, whether a first current frame is a snore segment start frame in the target voice segment, where the first current frame is the start frame of the target voice segment, or the first current frame is any intermediate frame after the start frame of the target voice segment;

the second determining module is used for determining a snore segment ending frame from the next frame of the snore segment starting frame if the snore segment starting frame is determined; taking the sound signal between the snore segment start frame and the snore segment end frame as a snore signal; wherein the determining the snore segment ending frame from the next frame of the snore segment starting frame comprises: starting from the next frame of the snore segment starting frame, carrying out the following detection processing on the second current frame: determining whether the second current frame is an end frame of the target human voice segment; if the second current frame is the ending frame of the target voice section, taking the second current frame as a snore section ending frame; if the second current frame is not the ending frame of the target voice section, determining whether the second current frame is a snore section ending frame, and if so, taking the second current frame as the snore section ending frame; and if not, carrying out the detection processing on the next frame of the second current frame.

15. An electronic device, comprising:

a processor; and

a memory in which a program is stored,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of claims 1-13.

16. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-13.