CN116052712A - Voice signal processing method, device, computing equipment and storage medium - Google Patents

Voice signal processing method, device, computing equipment and storage medium Download PDF

Info

Publication number
CN116052712A
CN116052712A CN202111262355.4A CN202111262355A CN116052712A CN 116052712 A CN116052712 A CN 116052712A CN 202111262355 A CN202111262355 A CN 202111262355A CN 116052712 A CN116052712 A CN 116052712A
Authority
CN
China
Prior art keywords
voice
speech
frame
representative
active
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111262355.4A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111262355.4A priority Critical patent/CN116052712A/en
Publication of CN116052712A publication Critical patent/CN116052712A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)

Abstract

The present disclosure provides a voice signal processing method, which includes: acquiring at least one representative speech frame of the speech signal; acquiring power spectrum information corresponding to each representative voice frame in the at least one representative voice frame and pitch frequency; determining that the voice signal is blocked by a blocking object when the voice quality of the voice signal is determined to be damaged based on the pitch frequency and the power spectrum information corresponding to each representative voice frame; and compensating the voice signal when it is determined that the voice signal is blocked by the blocking object. Furthermore, the disclosure relates to a speech signal processing arrangement, a computing device and a computer readable storage medium.

Description

Voice signal processing method, device, computing equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technology, and in particular, to a voice signal processing method, a voice signal processing apparatus, a computing device, and a computer readable storage medium for a voice signal blocking situation.
Background
Currently, voice call applications are becoming increasingly popular. However, in some application scenarios, the voice signal may be blocked by the blocking object, so that the sound quality may be impaired. For example, public health safety issues are becoming more and more important during epidemic situations, and in order to prevent viral transmission and avoid viral infection, wearing masks has become a necessary way of doing in daily life. However, when wearing the mask, the mouth is blocked by the mask, and the speech is blocked by the mask, so that the speech appears to be choked as compared with when not wearing the mask. If the user carries out voice communication under the condition of wearing a mask, after speaking voice is transmitted through voice encoding and decoding and a network, the voice intelligibility of the voice is greatly influenced, so that a user receiving a voice signal is difficult to hear and understand the content expressed by the voice signal, and the experience and effect of the voice communication are seriously influenced. The existing voice call application does not consider the case where the voice signal is blocked by a blocking object (for example, a scene where the user wears a mask), so conventional voice enhancement methods such as echo cancellation, noise suppression, volume enhancement, etc. cannot function in such a scene.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a voice signal processing method, comprising: acquiring at least one representative speech frame of the speech signal; acquiring power spectrum information corresponding to each representative voice frame in the at least one representative voice frame and pitch frequency; when it is determined that the sound quality of the speech signal is impaired based on the pitch frequency and power spectrum information corresponding to each representative speech frame, it is determined that the speech signal is blocked by a blocking object.
According to some exemplary embodiments, the acquiring at least one representative speech frame of the speech signal comprises: acquiring a plurality of active voice frames of the voice signal; performing Fourier transform on the plurality of active voice frames to obtain power spectrum information corresponding to each active voice frame in the plurality of active voice frames; based on the power spectrum information corresponding to each active voice frame, obtaining a middle-low frequency band energy value corresponding to each active voice frame; and determining the at least one representative voice frame from the plurality of active voice frames based on the middle-low frequency band energy value corresponding to each active voice frame.
According to some exemplary embodiments, the acquiring the plurality of active speech frames of the speech signal comprises: and carrying out voice activity detection on the voice signal to obtain a plurality of active voice frames of the voice signal, wherein each active voice frame comprises human voice and has a preset time period length.
According to some exemplary embodiments, the determining the at least one representative speech frame from the plurality of active speech frames based on the mid-low band energy values corresponding to the active speech frames comprises: and when the low-frequency band energy value is larger than a preset middle-low frequency band energy threshold value, determining the active voice frame corresponding to the middle-low frequency band energy value as a representative voice frame.
According to some exemplary embodiments, the determining the at least one representative speech frame from the plurality of active speech frames based on the mid-low band energy values corresponding to the active speech frames comprises: sequencing the plurality of active speech frames based on the middle-low frequency band energy value corresponding to each active speech frame; based on the results of the ranking, the at least one representative speech frame is determined from the plurality of active speech frames.
According to some example embodiments, said determining said at least one representative speech frame from said plurality of active speech frames based on a result of said ranking comprises: and starting from the active voice frame corresponding to the maximum middle-low frequency band energy value, selecting the representative voice frame from the plurality of active voice frames according to the preset quantity and the order of sequentially decreasing the middle-low frequency band energy value.
According to some example embodiments, said determining said at least one representative speech frame from said plurality of active speech frames based on a result of said ranking comprises: and starting from the active voice frame corresponding to the maximum middle-low frequency band energy value, selecting the representative voice frame from the plurality of active voice frames according to a preset percentage according to the sequence of sequentially decreasing the middle-low frequency band energy value.
According to some exemplary embodiments, the obtaining the pitch frequency corresponding to the at least one representative speech frame comprises: and detecting the pitch frequency of the plurality of active voice frames to obtain the pitch frequency corresponding to each active voice frame.
According to some exemplary embodiments, the speech signal processing method further comprises: when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated.
According to some example embodiments, the compensating the speech signal when it is determined that the speech signal is blocked by a blocking object comprises: determining compensation gains corresponding to respective frequency bands based on the pitch frequencies; and performing voice enhancement processing on the voice signal by using the compensation gain.
According to some exemplary embodiments, the determining the gain corresponding to each frequency band based on the pitch frequency comprises: dividing a pitch frequency range into a plurality of pitch intervals, and determining an interval gain for each of the plurality of pitch intervals; determining a pitch interval in which the pitch frequency is located; and determining the interval gain of the pitch interval where the pitch frequency is located as the compensation gain.
According to some exemplary embodiments, the dividing the pitch frequency range into a plurality of pitch intervals, and determining the interval gain for each of the plurality of pitch intervals comprises: for each pitch interval, respectively calculating the average value of the power spectrum of each frequency band of a plurality of frames of unobstructed voice signal samples and the average value of the power spectrum of each frequency band of a plurality of frames of obstructed voice signal samples; and determining the interval gain of each pitch interval based on the ratio of the average value of the power spectrums of each frequency band of the multi-frame non-blocking voice sample to the average value of the power spectrums of each frequency band of the multi-frame blocked voice sample.
According to some exemplary embodiments, the plurality of intervals are equally divided over the pitch frequency range.
According to some exemplary embodiments, the determining that the speech signal is blocked by the blocking object when it is determined that the sound quality of the speech signal is impaired based on the pitch frequency and the power spectrum information corresponding to each representative speech frame comprises: detecting the tone quality of the voice signal by utilizing a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame so as to generate voice signal blocking probability; when the blocking probability of the voice signal is larger than a preset voice signal blocking threshold value, determining that the voice signal is blocked by a blocking object; and training the neural network by utilizing pitch frequency and power spectrum information of a plurality of training voice samples, thereby obtaining the voice signal blockage detection model.
According to a second aspect of the present disclosure, there is provided a speech signal processing apparatus comprising: a representative speech frame acquisition module configured to acquire at least one representative speech frame of a speech signal; the power spectrum information acquisition module is configured to acquire power spectrum information corresponding to each representative voice frame in the at least one representative voice frame; a pitch frequency acquisition module configured to acquire a pitch frequency corresponding to each of the at least one representative speech frame; and a speech signal blocking condition determination module configured to: when it is determined that the sound quality of the speech signal is impaired based on the pitch frequency and power spectrum information corresponding to each representative speech frame, it is determined that the speech signal is blocked by a blocking object.
According to some exemplary embodiments, the voice signal processing apparatus further comprises: a compensation module configured to: when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated.
According to some exemplary embodiments, the compensation module is further configured to: determining compensation gains corresponding to respective frequency bands based on the pitch frequencies; and performing voice enhancement processing on the voice signal by using the compensation gain.
According to some example embodiments, the speech signal blocking condition determination module is further configured to: detecting the tone quality of the voice signal by utilizing a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame so as to generate voice signal blocking probability; when the blocking probability of the voice signal is larger than a preset voice signal blocking threshold value, determining that the voice signal is blocked by a blocking object; the voice signal blockage detection model is obtained by training a neural network in the voice signal blockage situation determination module by utilizing pitch frequency and power spectrum information of a plurality of training voice samples.
According to a third aspect of the present disclosure, there is provided a computing device comprising a processor and a memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the speech signal processing method provided according to the various aspects and exemplary embodiments described above.
According to a third aspect of the present disclosure, there is provided a computer readable storage medium configured to store computer executable instructions configured to, when executed on a processor, cause the processor to perform the speech signal processing method provided according to the above aspects and exemplary embodiments.
The voice signal processing method for the voice signal blocking condition at least achieves the following beneficial technical effects: first, it can be determined whether a voice signal is blocked by a blocking object (e.g., due to a user wearing a mask) during a voice call; second, can classify to the voice signal by the separation thing separation and lead to the impaired degree of tone quality, adopt different pronunciation enhancement parameters to handle under the different grades to realize tone quality promotion, solved under the condition such as putting on the mouth and talking under the gauze mask, the opposite side discerns the problem of unclear speaking content.
Drawings
Specific embodiments of the present disclosure will be described in detail below with reference to the drawings so that more details, features, and advantages of the present disclosure can be more fully appreciated and understood; in the drawings:
fig. 1A and 1B schematically illustrate a case where the sound quality of a voice signal is impaired in an application scenario where the voice signal is blocked by a blocking object;
fig. 2 schematically illustrates, in flow chart form, a speech signal processing method according to some exemplary embodiments of the present disclosure;
FIG. 3 further schematically illustrates details of corresponding steps in the speech signal processing method illustrated in FIG. 2 in flow chart form according to some exemplary embodiments of the present disclosure;
FIG. 4 further schematically illustrates details of corresponding steps in the method illustrated in FIG. 3 in flow chart form according to some exemplary embodiments of the present disclosure;
FIG. 5 further schematically illustrates details of corresponding steps in the speech signal processing method illustrated in FIG. 2 in flow chart form according to some exemplary embodiments of the present disclosure;
fig. 6 schematically illustrates, in flow chart form, a speech signal processing method according to further exemplary embodiments of the present disclosure;
FIG. 7 further schematically illustrates details of corresponding steps in the speech signal processing method illustrated in FIG. 6 in flowchart form, according to some exemplary embodiments of the present disclosure;
FIG. 8 further schematically illustrates details of corresponding steps in the method illustrated in FIG. 7 in flow chart form according to some exemplary embodiments of the present disclosure;
FIG. 9 further schematically illustrates details of corresponding steps in the method illustrated in FIG. 8 in flow chart form according to some exemplary embodiments of the present disclosure;
fig. 10 schematically illustrates, in flowchart form, a speech signal processing method according to further exemplary embodiments of the present disclosure;
Fig. 11 schematically shows, in the form of a spectrogram, the effect of performing a speech enhancement process on a speech signal in the case where the speech signal is blocked by a blocking object;
fig. 12 schematically illustrates a structure of a voice signal processing apparatus according to some exemplary embodiments of the present disclosure;
fig. 13 schematically illustrates a structure of a voice signal processing apparatus according to other exemplary embodiments of the present disclosure;
fig. 14 schematically illustrates the structure of a neural network included in the respective modules in the voice signal processing apparatus illustrated in fig. 12, 13;
fig. 15 schematically illustrates a structure of an exemplary computing device according to one embodiment of the disclosure.
It should be understood that the matters shown in the drawings are merely illustrative and thus are not necessarily drawn to scale. Furthermore, the same or similar features are denoted by the same or similar reference numerals throughout the drawings.
Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art may fully understand and practice the various embodiments of the disclosure.
First, some terms involved in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:
Pitch frequency: refers to the vibration frequency of the vocal cords when the person makes a sound. In general, pitch frequency is related to the length, thickness, toughness, stiffness, pronunciation habits, etc. of an individual's vocal cords, reflecting the individual's characteristics to a large extent. In addition, the pitch frequency also varies with the sex and age of the person. In general, the pitch frequency of a male speaker is low, while the pitch frequency of female speakers and children is relatively high.
Long and short term memory network (LSTM network): is a special RNN neural network structure after improving the traditional cyclic neural network (namely RNN network), and can learn long-term dependence. Each LSTM structure comprises an input gate, a forget gate and an output gate, and the input value, the memory value and the output value are controlled respectively. The LSTM network not only can solve the problem that the traditional RNN network cannot process long-distance dependence, but also can solve the problems of gradient explosion or gradient disappearance and the like which are common in the neural network, and is very effective in processing sequence data.
Gated loop unit network (GRU network): is a modification of the LSTM network obtained by simplifying the LSTM network. Unlike the LSTM structure, each GRU unit has only an update gate and a reset gate. The update gate is used to control the extent to which the state information at the previous time is brought into the current state, the larger the value of the update gate, the more state information at the previous time is brought. The reset gate controls how much information is written to the current candidate set for the previous state, the smaller the reset gate, the less the previous state is written. Therefore, the GRU model is simpler by the LSTM model, so the training efficiency is higher, and the requirement on hardware is lower.
As already mentioned above, in some application scenarios where a voice call is performed, the voice signal may be blocked by the blocking object, so that the sound quality thereof may be impaired. One possible application scenario is where there is a mask in the case of a user on each side of a voice call. Under the condition of wearing the mask, because the mouth is blocked by the mask, the voice of the user is blocked by the mask when speaking, so the voice of the user is choked compared with the voice of the user when not wearing the mask, and the voice intelligibility is obviously reduced.
Referring to fig. 1A and 1B, there is schematically shown a case where the sound quality of a voice signal is impaired in an application scene of a mask. Fig. 1A schematically shows sound quality of the same speech signal after recording when not wearing the mask and when wearing the mask, respectively, in the form of a spectrogram. Fig. 1B schematically shows, in the form of a spectrogram, the voice spectrum of the same section of voice signal after recording when not wearing the mask and when wearing the mask, respectively. As can be seen from the figure, when the mask is worn, the sound is blocked by the mask, so that the sound quality of the voice signals in different frequency bands suffers from different degrees of loss, wherein the low frequency part is slightly damaged, but the damage of the middle and high frequency parts is more obvious, and the luminance in the middle and high frequency (2000 and Hz or above) range is weakened (as shown in fig. 1A) or the amplitude of the spectral line in the middle and high frequency (2000 and Hz or above) range is reduced more (as shown in fig. 1B). The mid-to-high frequency speech signal has a great influence on speech intelligibility, and if this part of the signal is severely damaged, the result is that the user receiving the speech signal may not know the content of the speech signal, which may seriously affect the effect and experience of the speech call.
Referring to fig. 2, a speech signal processing method according to some exemplary embodiments of the present disclosure, which may be applied to a case where a speech signal is blocked by a blocking object, is schematically shown in the form of a flowchart. The voice signal processing method can detect the voice quality of a voice signal based on the pitch frequency of the voice signal and the power spectrum information of each frequency point in the voice frequency band range so as to determine whether the voice quality of the voice signal is damaged, and when the voice quality of the voice signal is determined to be damaged, it is determined that the voice signal is blocked by a blocking object. Therefore, the voice signal processing method can be applied to, for example, the application scenario shown in fig. 1A and 1B, so as to detect whether or not the current voice call is in a state of wearing a mask. As shown in fig. 2, the speech signal processing method 100 may include steps 110, 120, 130, and 140:
at step 110, at least one representative speech frame of the speech signal is acquired;
in step 120, power spectrum information corresponding to each representative voice frame in the at least one representative voice frame is obtained;
in step 130, a pitch frequency corresponding to each representative speech frame in the at least one representative speech frame is obtained;
In step 140, when it is determined that the sound quality of the speech signal is impaired based on the pitch frequency and power spectrum information corresponding to each representative speech frame, it is determined that the speech signal is blocked by a blocking object.
In step 110, a predetermined number of active speech frames may be selected from a plurality of active speech frames (which may be detected by a speech activity algorithm, as a non-limiting example) of the speech signal as representative speech frames of the speech signal according to a predetermined criterion for use in detecting the quality of sound of the speech signal. The active speech frame is a speech frame in which a human voice signal is contained in a speech signal and which has a preset period of time length. Representative speech frames are active speech frames selected from among active speech frames of a speech signal that reflect characteristics of the speech signal and are useful for evaluating the quality of sound of the speech signal. During a voice call, the energy of the human voice is mainly concentrated in the middle and low frequency band ranges of the voice band range, and thus, the representative voice frame may be generally selected according to the middle and low frequency band energy value of the middle and low frequency band range of the active voice frame corresponding to the voice band range, which will be described in further detail below. However, it should be understood that any suitable active speech frame of a speech signal may be used as a representative speech frame of the speech signal, as desired.
Thus, the voice signal processing method 100 can determine whether a voice signal is blocked by a blocking object, for example, a blocking object due to a user wearing a mask, during a voice call. However, the existing voice communication process does not consider the situation that the voice signal is blocked by the blocking object, so that in an application scene of wearing the mask by a user, for example, whether the user wears the mask cannot be identified.
Referring to fig. 3, details of step 110 in the speech signal processing method 100 shown in fig. 2 are schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 3, in this exemplary embodiment, step 110 may further include steps 111, 112, 113, 114:
at step 111, a plurality of active speech frames of the speech signal are acquired;
in step 112, fourier transforming the plurality of active speech frames to obtain power spectrum information corresponding to each active speech frame in the plurality of active speech frames;
in step 113, based on the power spectrum information corresponding to each active speech frame, obtaining a middle-low frequency band energy value corresponding to each active speech frame;
at step 114, the at least one representative speech frame is determined from the plurality of active speech frames based on the mid-low band energy value corresponding to each active speech frame.
In step 111, the e.g. recorded speech signal may be detected by any suitable means, e.g. a speech activity algorithm, from which a speech signal comprising a person's speech signal and having a predetermined period of time is detected as an active speech frame. The part of the speech signal that does not contain human voice, such as the period of time during which no human voice is produced during recording, may be discarded. Therefore, all active speech frames contain human voice signals, and thus can be used to evaluate the quality of the voice signals.
In step 112, fourier transforms may be performed on the obtained plurality of active speech frames to obtain corresponding power spectrum information for each frequency point of each active speech frame within the audio frequency band. It is understood that the sound band range is known in the art, which refers to the frequency range in which a person can hear sound, i.e. the frequency range of 20 Hz to 20000 Hz. Accordingly, the power spectrum information of each active speech frame obtained in step 112 is a corresponding power spectrum of the sound signal of the person contained in each active speech frame at each frequency point of the sound band range.
In step 113, the power spectrum information of each active speech frame may be integrated in the middle-low frequency band of the audio frequency band, so as to obtain the middle-low frequency band energy value corresponding to each active speech frame. It is to be understood that the mid-low frequency band range of the sound band range is also known in the art and generally refers to the frequency range of 20 Hz to 2000 Hz. In a general voice call application scenario, a main frequency component of a human voice signal is mostly in the middle-low frequency band range. It should be appreciated that in other exemplary embodiments, a narrower frequency range may be selected from the above-described mid-low frequency range to calculate the mid-low frequency band energy value corresponding to each active speech frame, e.g., a frequency range of 300 Hz to 1200 Hz may be selected to calculate the mid-low frequency band energy value corresponding to each active speech frame.
In step 114, in the case that the mid-low band energy value corresponding to each active speech frame is obtained, at least one representative speech frame may be determined from a plurality of active speech frames in different manners according to actual needs. In one non-limiting embodiment, the mid-low band energy value corresponding to each active speech frame may be compared with a preset mid-low band energy threshold, and when the mid-low band energy value is greater than the mid-low band energy threshold, the active speech frame corresponding to the mid-low band energy value may be determined to be a representative speech frame.
Further, referring to fig. 4, details of step 114 in the method shown in fig. 3 are further schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 4, in this exemplary embodiment, step 114 may further comprise steps 1141, 1142:
in step 1141, the plurality of active speech frames are ordered based on the mid-low band energy values corresponding to the active speech frames;
in step 1142, the at least one representative speech frame is determined from the plurality of active speech frames based on the results of the ranking.
In step 1141, an appropriate ordering method may be selected according to actual needs. As a non-limiting example, the medium-low band energy values may be ordered from large to small. However, any other suitable ordering is possible, which is not limiting in this disclosure
In step 1142, the representative speech frame may be determined from the plurality of active speech frames according to different criteria as required, provided that the selected representative speech frame contains sufficient sound signals to be able to be used for evaluating the timbre of the speech signals. As a non-limiting example, a preset number of active speech frames may be selected as representative speech frames from the plurality of active speech frames in order of sequentially decreasing mid-low band energy values, starting from the active speech frame corresponding to the largest mid-low band energy value. In another non-limiting example, starting from the active speech frame corresponding to the maximum middle-low frequency band energy value, the number of active speech frames corresponding to the preset percentage may be selected as the representative speech frame from the plurality of active speech frames according to the preset percentage in the order in which the middle-low frequency band energy values decrease sequentially. For example, the percentage may be 20%. In this case, the active speech frame of the first 20% of the sequence may be selected as the representative speech frame from the plurality of active speech frames starting from the active speech frame corresponding to the maximum mid-low band energy value.
The representative speech frames are selected in a preset number, which is relatively easy to implement, while the representative speech frames are selected in a preset percentage, which is capable of selecting the representative speech frames in a variable number according to the number of active speech frames, thus being more flexible.
With continued reference to fig. 2, in step 120 of the speech processing method 100, after the representative speech frame is acquired, the fourier transform may be performed on the representative speech frame, so as to obtain corresponding power spectrum information of each frequency point of each representative speech frame in the audio frequency band range. However, it should be understood that when the representative speech frame is obtained by adopting the method shown in fig. 3, since the corresponding power spectrum information has been calculated for each active speech frame, when the corresponding active speech frame is determined to be the representative speech frame, the corresponding power spectrum information thereof is also correspondingly the power spectrum information of the representative speech frame.
In step 130 of the speech processing method 100, the pitch frequency corresponding to each active speech frame may be obtained by pitch frequency detection of at least one representative speech frame. It should be appreciated that any suitable pitch frequency detection method may be used in the speech processing method 100 to detect the pitch frequency corresponding to each active speech frame. The pitch frequencies reflect differences in the sound characteristics of different human voices. In general, the larger the pitch frequency value of a sound, the higher the duty cycle of the medium-high frequency energy is generally. For example, the pitch frequency value of female acoustic signals is typically large, so the duty cycle of the high frequency energy therein is typically higher compared to male acoustic signals. The pitch frequency corresponding to each active voice frame of the voice signal is acquired, so that the voice quality of the voice signal can be detected more accurately, and corresponding equalization parameters can be given out according to the pitch frequency value range to perform equalization adjustment.
In step 140 of the speech processing method 100, any suitable manner may be selected to determine whether the quality of the speech signal is impaired, as desired. As a non-limiting example, the corresponding pitch frequency and power spectrum information for each representative speech frame may be compared to the pitch frequency and power spectrum information for the corresponding non-blocking speech signal, and if the strength of the speech signal contained by each representative speech frame at the corresponding frequency point is reduced, it may be determined that the speech signal contained by each representative speech frame is acoustically impaired compared to the non-blocking speech signal.
Further, referring to fig. 5, details of step 140 in the speech signal processing method 100 shown in fig. 2 are further schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 5, in this exemplary embodiment, step 140 may further include steps 141, 142:
in step 141, detecting the tone quality of the voice signal by using a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame, so as to generate a voice signal blocking probability;
in step 142, when the speech signal blocking probability is greater than a preset speech signal blocking threshold, it is determined that the speech signal is blocked by a blocking object.
In step 141, the speech signal blockage detection model may be obtained by training a corresponding neural network using pitch frequency and power spectrum information for a plurality of training speech samples. As a non-limiting example, a number of training speech samples of pitch frequency and power spectrum information may be used as inputs to the deep learning capable neural network, and the output expectations for training the neural network correspond to, for example, "1" for speech signal blocked samples and "0" for speech signal unblocked samples. It should be appreciated that for an application scenario with a mask, the output expected value for training the neural network may correspond to, for example, "1" for a mask sample and "0" for a mask sample without. Furthermore, it should also be appreciated that a neural network of any suitable architecture is possible as long as it is capable of training with pitch frequency and power spectral information of multiple training speech samples, thereby being able to give an output probability value as to whether the speech signal is blocked. An exemplary structure of the deep learning-enabled neural network will be described in more detail below.
In step 142, a voice signal blocking threshold may be preset, and when the voice signal blocking probability output by the voice signal blocking detection model is greater than the preset voice signal blocking threshold, it is determined that the voice signal is blocked by the blocking object. As a non-limiting example, a high voice signal blocking threshold value may be set to 0.7, and a low voice signal blocking threshold value may be set to 0.3, where the voice signal is determined to be in a blocked state (e.g., in a mask-wearing state) when the voice signal blocking probability output by the voice signal blocking detection model is greater than 0.7, where the voice signal is determined to be in an unblocked state (e.g., in a mask-not-wearing state) when the voice signal blocking probability is less than 0.3, and where the voice signal blocking probability is determined to be between 0.3 and 0.7 (e.g., 0.3 or more and 0.7 or less). In the above non-limiting example, the corresponding compensation measures may be taken only when it is determined that the speech signal is in a blocked state.
Referring to fig. 6, a voice signal processing method according to other exemplary embodiments of the present disclosure, which may be applied to a case where a voice signal is blocked by a blocking object, is schematically shown in the form of a flowchart. The speech signal processing method 200 shown in fig. 6 is substantially identical to the speech signal processing method 100 shown in fig. 2, except that the speech signal processing method 200 further comprises the step 150: when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated.
The compensation of the voice signal in step 150 is to perform a voice enhancement process on the voice signal to improve the sound quality. As a non-limiting example, the voice enhancement process may be an equalizer process, that is, the process acts on the corresponding frequency band according to the gain configuration parameters of each frequency band, so that different frequency bands of the voice signal are enhanced to different extents. However, it should be understood that any other way of compensating for the voice signal is possible, as well, and this disclosure is not limited in this regard.
Referring to fig. 7, details of step 150 in the speech signal processing method 200 shown in fig. 6 are further schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 7, in this exemplary embodiment, step 150 may further include steps 151, 152:
Determining a compensation gain corresponding to each frequency band based on the pitch frequency in step 151;
in step 152, a speech enhancement process is performed on the speech signal using the compensation gain.
As already mentioned in the foregoing of the present disclosure, the pitch frequency reflects differences in the sound characteristics of different human voices. In general, the larger the pitch frequency value of a sound, the higher the duty cycle of the medium-high frequency energy is generally. For example, the pitch frequency value of female acoustic signals is typically large, so the duty cycle of the high frequency energy therein is typically higher compared to male acoustic signals. The speech signal processing method 200 can provide corresponding gain parameters for targeted compensation according to the frequency range of the fundamental tone.
Referring to fig. 8, further details of step 151 in the method shown in fig. 7 are schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 8, in this exemplary embodiment, step 151 may further include steps 1511, 1512, 1513:
at step 1511, dividing the pitch frequency range into a plurality of pitch intervals, and determining an interval gain for each of the plurality of pitch intervals;
at step 1512, a pitch interval in which the pitch frequency is located is determined;
In step 1513, a section gain of a pitch section in which the pitch frequency is located is determined as the compensation gain.
In step 1511, the pitch frequency range may be divided into a plurality of pitch intervals according to a certain frequency range. As a non-limiting example, the pitch frequency range may be 50 Hz to 500 Hz, and thus, the pitch frequency range may be divided into 9 pitch intervals by dividing one pitch interval every 50 Hz. It will be appreciated that in this example the pitch frequency range is divided into a plurality of pitch intervals in an equally spaced manner, however the pitch frequency range may be divided into a plurality of pitch intervals in a different manner, for example the range of some pitch intervals may be different from the range of other pitch intervals. After dividing the pitch frequency range into a plurality of pitch intervals, the interval gain for each pitch interval can be determined.
Referring to fig. 9, details of step 1511 in the method shown in fig. 8 are further schematically shown in flow chart form according to some exemplary embodiments of the present disclosure. As shown in fig. 9, in this exemplary embodiment, step 1511 may further include steps 1511a, 1511b:
in step 1511a, for each pitch interval, calculating a respective band power spectrum average value of a plurality of frames of unobstructed speech signal samples and a respective band power spectrum average value of a plurality of frames of obstructed speech signal samples, respectively;
In step 1511b, a section gain for each pitch section is determined based on a ratio of the average of the band power spectra of the plurality of non-blocked speech samples to the average of the band power spectra of the plurality of blocked speech samples.
In step 1511a, a plurality of frames of unobstructed speech signal samples and a plurality of frames of obstructed speech signal samples may be classified into respective pitch intervals according to their pitch frequencies. Taking the above 9 pitch intervals as a non-limiting example, if there are 1000 frames of speech signal samples (including 500 frames of unobstructed speech signal samples and 500 frames of blocked speech signal samples) whose pitch frequency falls within the first pitch interval having a frequency range of 50 Hz to 100 Hz, then the 1000 frames of speech signal samples are classified into the first pitch interval, and the interval gain of the first pitch interval is calculated based on the 1000 frames of speech signal samples. The frequency point power spectrum impairment analysis can be performed based on the 1000 frames of speech signal samples, e.g., the average value of the power spectrum of each frequency band of 500 frames of non-blocking speech signal samples can be calculated
Figure DEST_PATH_IMAGE002
And the power spectrum average value +.f of each frequency band of 500 frames of the blocked voice signal sample can be calculated>
Figure DEST_PATH_IMAGE004
Wherein the subscript 1 denotes the first pitch interval, kIs the band number.
Thus, in step 1511b, the power spectrum average value for each frequency band may be based on
Figure 83050DEST_PATH_IMAGE002
And->
Figure 937873DEST_PATH_IMAGE004
Ratio of->
Figure 717611DEST_PATH_IMAGE002
//>
Figure 549038DEST_PATH_IMAGE004
To determine the section gains of the first pitch section corresponding to the frequency bands. It should be appreciated that the interval gains of the remaining pitch intervals may be determined in a similar manner and are not described in detail herein.
Referring to fig. 10, a voice signal processing method according to other exemplary embodiments of the present disclosure, which may be applied to a case where a voice signal is blocked by a blocking object, is schematically shown in the form of a flowchart.
As shown in fig. 10, the speech signal processing method 300 may begin at step 310. In step 310, the voice signal may be recorded using any suitable device. At step 320, voice activity detection may be performed on the recorded voice signal to obtain a plurality of active voice frames. In step 330, the obtained active speech frames may be fourier transformed to calculate power spectrum information for the corresponding frequency bands for each active speech frame. In step 340, the power spectral information for each corresponding frequency band of each active speech frame may be integrated over a mid-low frequency band to calculate mid-low frequency band energy values, and the active speech frames may be ordered based on the resulting mid-low frequency band energy values to pick at least one representative speech frame. At step 380, pitch frequency detection may be performed on each active speech frame to obtain a pitch frequency for each active speech frame. In step 350, the tone quality of the speech signal is detected by using the speech signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative speech frame, so as to output a speech signal blocking probability. In step 360, it is determined whether the voice signal is blocked by the blocking object based on the voice signal blocking probability, and if it is determined that the voice signal is not blocked by the blocking object, the voice signal processing method 300 ends, and if it is determined that the voice signal is blocked by the blocking object, the voice signal processing method 300 proceeds to step 370. In step 370, the speech signal processing method 300 equalizes the speech signal using the equalization parameters obtained in step 390 for speech enhancement. It should be appreciated that the respective steps in the speech signal processing method 300 may be implemented using the respective methods described above with respect to the various exemplary embodiments. For example, at step 390, the speech signal processing method 300 may utilize the methods described above with respect to fig. 8 and 9 to obtain equalization parameters. Therefore, details of each step in the voice signal processing method 300 will not be described herein.
Referring to fig. 11, the effect of performing a voice enhancement process on a voice signal in the case where the voice signal is blocked by a blocking object is schematically shown in the form of a spectrogram, in which (a) is a spectrogram when a section of the voice signal is not blocked by the blocking object, (b) is a spectrogram when the same section of the voice signal is blocked by the blocking object, and (c) is a spectrogram after performing a voice enhancement when the same section of the voice signal is blocked by the blocking object. As shown in fig. 11, after the voice enhancement processing, each frequency band of the voice signal blocked by the blocking object is enhanced, and the middle and high frequencies are obviously improved, which is relatively close to the spectrogram of the voice signal when not blocked by the blocking object, so that the sound quality is obviously improved from the sense of hearing, and the voice is clearer and more understandable.
Referring to fig. 12, a structure of a voice signal processing apparatus according to some exemplary embodiments of the present disclosure is schematically shown. The voice signal processing apparatus 500 may be used to determine whether a voice signal is blocked by a blocking object (e.g., an application scenario with a mask) in a voice call application. As shown in fig. 12, the speech signal processing apparatus 500 may include a representative speech frame acquisition module 510, a power spectrum information acquisition module 520, a pitch frequency acquisition module 530, and a speech signal blocking condition determination module 540.
The representative speech frame acquisition module 510 is configured to acquire at least one representative speech frame of a speech signal. The power spectrum information obtaining module 520 is configured to obtain power spectrum information corresponding to each of the at least one representative speech frame. The pitch frequency acquisition module 530 is configured to acquire a pitch frequency corresponding to each of the at least one representative speech frame. The speech signal blocking condition determination module 540 is configured to: when it is determined that the sound quality of the speech signal is impaired based on the pitch frequency and power spectrum information corresponding to each representative speech frame, it is determined that the speech signal is blocked by a blocking object.
Referring to fig. 13, a structure of a voice signal processing apparatus according to other exemplary embodiments of the present disclosure is schematically shown. The speech signal processing apparatus 500 'shown in fig. 13 differs from the speech signal processing apparatus 500 shown in fig. 12 only in that the speech signal processing apparatus 500' further comprises a compensation module 550. The compensation module 550 is configured to: when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated. As a non-limiting example, the compensation module 550 may be further configured to: a compensation gain corresponding to each frequency band is determined based on the pitch frequency, and a speech enhancement process is performed on the speech signal using the compensation gain.
Furthermore, in some exemplary embodiments, representative speech frame acquisition module 510 may be further configured to: detecting the tone quality of the voice signal by utilizing a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame so as to generate voice signal blocking probability; when the blocking probability of the voice signal is larger than a preset voice signal blocking threshold value, determining that the voice signal is blocked by a blocking object; the voice signal blockage detection model can be obtained by training a corresponding neural network by utilizing pitch frequency and power spectrum information of a plurality of training voice samples.
It should be appreciated that the neural network used to train to generate the above-described speech signal occlusion detection model may have any suitable neural network structure known in the art as long as it is capable of performing a function of judging whether the speech signal is occluded by an occluding object based on pitch frequency and power spectrum information of the input speech signal based on training of a plurality of training speech samples. For a better understanding of the present invention, the structure of a neural network is schematically described below as a non-limiting example.
Referring to fig. 14, a schematic diagram is shown representing the structure of a neural network used to implement a speech signal blocker detection model in the speech frame acquisition module 510. As shown in fig. 14, the neural network 700 may include a feature splice layer 710, a first fully connected layer (i.e., a first FC layer) 720, a first gated-loop unit layer (i.e., a first GRU layer) 730, a second gated-loop unit layer (i.e., a second GRU layer) 740, a second fully connected layer (i.e., a second FC layer) 750, and an activation layer 760. Feature concatenation layer 710 receives pitch frequency and power spectrum information for each representative speech frame and concatenates the input data into a feature vector. The first full connection layer 720, the first gating loop unit layer 730, the second gating loop unit layer 740, and the second full connection layer 750 perform reasoning based on feature vectors received from the feature concatenation layer 710, and generate a reasoning output value. The activation layer 760 normalizes the inferred output value to a probability value between 0 and 1, which can be used as a speech signal blocking probability.
The configuration of the neural network 700 shown in fig. 14 is merely exemplary and not limiting. It should be appreciated that any suitable neural network architecture is possible so long as it is capable of being trained using pitch frequency and power spectral information of a plurality of training speech samples to obtain a speech signal occlusion detection model. For example, the neural network 700 shown in fig. 14 includes two layers of gated loop cell layers, however, the neural network 700 may include fewer (e.g., one) or more (e.g., three or more) gated loop cell layers. Furthermore, in other exemplary embodiments, a long and short term memory layer (i.e., LSTM layer) may be utilized instead of the gated loop cell layer. As described above, the activation layer 760 is configured to normalize the inferred output value to a probability value between 0 and 1, and thus the activation function used may be a Sigmoid function or a ReLU function.
It should be appreciated that the inventive concept is to use the pitch frequency and power spectrum information of the speech signal to determine whether the speech signal is blocked by a blocking object and to compensate accordingly when the speech signal is blocked by a blocking object. The pitch frequency of the voice signal is utilized to help distinguish voice emitted by a person from noise on one hand, and on the other hand, the pitch frequency also reflects the characteristics of the person, so that the voice quality of the voice signal is more accurately detected, and corresponding equalization parameters are more accurately given according to the pitch frequency value range to perform equalization adjustment. The change in the power spectrum information of the voice signal directly reflects the change in the sound quality of the voice signal, as shown in fig. 1B. Therefore, if the power spectrum at each frequency point of the voice signal is found to be reduced, it can be judged that the sound quality of the voice signal is impaired. Therefore, the pitch frequency and the power spectrum information of the voice signal are combined, so that the voice quality of the voice signal can be accurately detected, and the voice quality of the voice signal can be pertinently compensated. As a non-limiting example, the pitch frequency range may be divided into a plurality of pitch intervals, the size of the pitch intervals may be set according to practical situations, and then different weight coefficients may be given to the power spectrum information variation values of the voice signal at each frequency point according to the pitch interval in which the pitch frequency of the voice signal is located, so as to calculate by using the corresponding weight coefficients according to the pitch frequency and the power spectrum information, and finally generate corresponding voice signal blocking probabilities.
Thus, the plurality of training speech samples for training should include pitch frequency and power spectral information of the speech signal, and also a determination of whether the speech signal is blocked by a blocking object. After training, the neural network can obtain each required weight coefficient and bias coefficient, so as to construct a voice signal blockage detection model. The voice signal blocking detection model can correspondingly generate voice signal blocking probability when receiving the input of the fundamental tone frequency and the power spectrum information of the actual voice signal, thereby judging whether the voice signal is blocked by a blocking object.
It should be understood that the above-described respective modules described in connection with fig. 12 and 13 relate to the operations of the related steps in the various methods described above with respect to fig. 2 to 10, respectively, and thus are not repeated herein. Furthermore, it should also be understood that each of the modules described above in connection with fig. 12, 13 may be implemented in hardware or in hardware in connection with software and/or firmware. For example, the modules may be implemented as computer-executable code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of these modules may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.
Referring to fig. 15, a structure of a computing device 900 in accordance with some embodiments of the present disclosure is schematically shown in block diagram form. Computing device 900 may be used in various application scenarios described in this disclosure.
Computing device 900 may include at least one processor 902, memory 904, communication interface(s) 906, display device 908, other input/output (I/O) devices 910, and one or more mass storage 912 capable of communicating with each other, such as by a system bus 914 or other suitable means of connection.
The processor 902 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 902 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 904, mass storage 912, or other computer-readable medium, such as program code for the operating system 916, program code for the application programs 918, program code for other programs 920, and the like.
Memory 904 and mass storage device 912 are examples of computer-readable storage media for storing instructions that can be executed by processor 902 to implement the various functions as previously described. For example, the memory 904 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 912 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. The memory 904 and mass storage device 912 may both be referred to herein as computer-readable memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer-executable code that may be executed by the processor 902 as a particular machine configured to implement the operations and functions described in the various exemplary embodiments of the present disclosure.
A number of program modules may be stored on the mass storage device 912. These program modules include an operating system 916, one or more application programs 918, other programs 920, and program data 922, and can be executed by the processor 902. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer-executable code or instructions) for implementing the following components/functions: representative speech frame acquisition module 510, power spectrum information acquisition module 520, pitch frequency acquisition module 530, and speech signal blocker determination module 540, and may also include a compensation module.
Although illustrated in fig. 15 as being stored in memory 904 of computing device 900, modules 916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer readable media accessible by computing device 900. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer storage media as defined by the present disclosure does not include communication media.
Computing device 900 may also include one or more communication interfaces 906 for exchanging data with other devices, such as over a network, direct connection, etc. Communication interface 906 may facilitate communication over a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 906 may also provide for communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.
In some examples, computing device 900 may also include a display device 908, such as a monitor, for displaying information and images. Other I/O devices 910 may be devices that receive various inputs from a user and provide various outputs to the user, including but not limited to touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.
The terminology used herein is for the purpose of describing embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this disclosure, specify the presence of stated features, but do not preclude the presence or addition of one or more other features. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various features, these features should not be limited by these terms. These terms are only used to distinguish one feature from another feature.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the description of the present specification, the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc. describe mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Various techniques are described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a list of executable instructions for implementing the logic functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. Furthermore, it should also be understood that the various steps of the methods shown in the flowcharts or otherwise described herein are merely exemplary and do not imply that the steps of the illustrated or described methods must be performed in accordance with the steps shown or described. Rather, the various steps of the methods shown in the flowcharts or otherwise described herein may be performed in a different order than in the present disclosure, or may be performed simultaneously. Furthermore, the methods represented in the flowcharts or otherwise described herein may include other additional steps as desired.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, it may be implemented using any one or combination of the following techniques, as known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps of the methods of the above embodiments may be performed by hardware associated with program instructions, and the program may be stored in a computer readable storage medium, which when executed, includes performing one or a combination of the steps of the method embodiments.
Although the present disclosure has been described in detail in connection with some exemplary embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims.

Claims (18)

1. A method of speech signal processing, comprising:
acquiring at least one representative speech frame of the speech signal;
acquiring power spectrum information corresponding to each representative voice frame in the at least one representative voice frame and pitch frequency;
determining that the voice signal is blocked by a blocking object when the voice quality of the voice signal is determined to be damaged based on the pitch frequency and the power spectrum information corresponding to each representative voice frame;
when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated.
2. The speech signal processing method of claim 1 wherein the acquiring at least one representative speech frame of a speech signal comprises:
acquiring a plurality of active voice frames of the voice signal;
performing Fourier transform on the plurality of active voice frames to obtain power spectrum information corresponding to each active voice frame in the plurality of active voice frames;
based on the power spectrum information corresponding to each active voice frame, obtaining a middle-low frequency band energy value corresponding to each active voice frame;
and determining the at least one representative voice frame from the plurality of active voice frames based on the middle-low frequency band energy value corresponding to each active voice frame.
3. The speech signal processing method of claim 2 wherein the acquiring a plurality of active speech frames of the speech signal comprises:
and carrying out voice activity detection on the voice signal to obtain a plurality of active voice frames of the voice signal, wherein each active voice frame comprises human voice and has a preset time period length.
4. The method for processing a speech signal according to claim 2, wherein said determining the at least one representative speech frame from the plurality of active speech frames based on the mid-low band energy values corresponding to the respective active speech frames comprises:
and when the low-frequency band energy value is larger than a preset middle-low frequency band energy threshold value, determining the active voice frame corresponding to the middle-low frequency band energy value as a representative voice frame.
5. The method for processing a speech signal according to claim 2, wherein said determining the at least one representative speech frame from the plurality of active speech frames based on the mid-low band energy values corresponding to the respective active speech frames comprises:
sequencing the plurality of active speech frames based on the middle-low frequency band energy value corresponding to each active speech frame;
based on the results of the ranking, the at least one representative speech frame is determined from the plurality of active speech frames.
6. The speech signal processing method of claim 5 wherein the determining the at least one representative speech frame from the plurality of active speech frames based on the results of the ranking comprises:
and starting from the active voice frame corresponding to the maximum middle-low frequency band energy value, selecting the representative voice frame from the plurality of active voice frames according to the preset quantity and the order of sequentially decreasing the middle-low frequency band energy value.
7. The speech signal processing method of claim 5 wherein the determining the at least one representative speech frame from the plurality of active speech frames based on the results of the ranking comprises:
and starting from the active voice frame corresponding to the maximum middle-low frequency band energy value, selecting the representative voice frame from the plurality of active voice frames according to a preset percentage according to the sequence of sequentially decreasing the middle-low frequency band energy value.
8. The method of speech signal processing according to claim 2, wherein said obtaining a pitch frequency corresponding to the at least one representative speech frame comprises:
and detecting the pitch frequency of the plurality of active voice frames to obtain the pitch frequency corresponding to each active voice frame.
9. The speech signal processing method of claim 1 wherein the compensating the speech signal when it is determined that the speech signal is blocked by a blocking object comprises:
determining compensation gains corresponding to respective frequency bands based on the pitch frequencies; and
and performing voice enhancement processing on the voice signal by using the compensation gain.
10. The method of speech signal processing according to claim 9, wherein said determining gains corresponding to respective frequency bands based on the pitch frequency comprises:
dividing a pitch frequency range into a plurality of pitch intervals, and determining an interval gain for each of the plurality of pitch intervals;
determining a pitch interval in which the pitch frequency is located; and
and determining the interval gain of the pitch interval where the pitch frequency is located as the compensation gain.
11. The method of speech signal processing according to claim 10, wherein the dividing the pitch frequency range into a plurality of pitch intervals and determining the interval gain for each of the plurality of pitch intervals comprises:
for each pitch interval, respectively calculating the average value of the power spectrum of each frequency band of a plurality of frames of unobstructed voice signal samples and the average value of the power spectrum of each frequency band of a plurality of frames of obstructed voice signal samples; and
And determining the interval gain of each pitch interval based on the ratio of the average value of the power spectrums of each frequency band of the multi-frame non-blocking voice sample to the average value of the power spectrums of each frequency band of the multi-frame blocked voice sample.
12. The speech signal processing method of claim 10 wherein the plurality of intervals are equally divided from the pitch frequency range.
13. The speech signal processing method of claim 1 wherein the determining that the speech signal is subject to occlusion when it is determined that the sound quality of the speech signal is impaired based on pitch frequency and power spectrum information corresponding to each representative speech frame comprises:
detecting the tone quality of the voice signal by utilizing a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame so as to generate voice signal blocking probability; and
when the voice signal blocking probability is larger than a preset voice signal blocking threshold value, determining that the voice signal is blocked;
and training the neural network by utilizing pitch frequency and power spectrum information of a plurality of training voice samples, thereby obtaining the voice signal blockage detection model.
14. A speech signal processing apparatus comprising:
a representative speech frame acquisition module configured to acquire at least one representative speech frame of a speech signal;
the power spectrum information acquisition module is configured to acquire power spectrum information corresponding to each representative voice frame in the at least one representative voice frame;
a pitch frequency acquisition module configured to acquire a pitch frequency corresponding to each of the at least one representative speech frame;
a speech signal blocking condition determination module configured to: determining that the voice signal is blocked by a blocking object when the voice quality of the voice signal is determined to be damaged based on the pitch frequency and the power spectrum information corresponding to each representative voice frame; and
a compensation module configured to: when it is determined that the speech signal is blocked by a blocking object, the speech signal is compensated.
15. The speech signal processing device of claim 14 wherein the compensation module is further configured to:
determining compensation gains corresponding to respective frequency bands based on the pitch frequencies; and
and performing voice enhancement processing on the voice signal by using the compensation gain.
16. The speech signal processing device of claim 14 wherein the speech signal blocking condition determination module is further configured to:
Detecting the tone quality of the voice signal by utilizing a voice signal blocking detection model based on the pitch frequency and power spectrum information corresponding to each representative voice frame so as to generate voice signal blocking probability; and
when the blocking probability of the voice signal is larger than a preset voice signal blocking threshold value, determining that the voice signal is blocked by a blocking object;
the voice signal blockage detection model is obtained by training a neural network in the voice signal blockage situation determination module by utilizing pitch frequency and power spectrum information of a plurality of training voice samples.
17. A computing device comprising a processor and a memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the speech signal processing method of any one of claims 1 to 13.
18. A computer readable storage medium configured to store computer executable instructions configured to, when executed on a processor, cause the processor to perform the speech signal processing method of any of claims 1 to 13.
CN202111262355.4A 2021-10-28 2021-10-28 Voice signal processing method, device, computing equipment and storage medium Pending CN116052712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111262355.4A CN116052712A (en) 2021-10-28 2021-10-28 Voice signal processing method, device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111262355.4A CN116052712A (en) 2021-10-28 2021-10-28 Voice signal processing method, device, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116052712A true CN116052712A (en) 2023-05-02

Family

ID=86122501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111262355.4A Pending CN116052712A (en) 2021-10-28 2021-10-28 Voice signal processing method, device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116052712A (en)

Similar Documents

Publication Publication Date Title
US10504539B2 (en) Voice activity detection systems and methods
CN108269569B (en) Speech recognition method and device
US11540063B2 (en) Hearing device comprising a detector and a trained neural network
US20180240472A1 (en) Voice Activity Detection Employing Running Range Normalization
CN103229238B (en) System and method for producing an audio signal
CN111836178A (en) Hearing device comprising a keyword detector and a self-voice detector and/or transmitter
US8504360B2 (en) Automatic sound recognition based on binary time frequency units
US20130024191A1 (en) Audio communication device, method for outputting an audio signal, and communication system
US20170061978A1 (en) Real-time method for implementing deep neural network based speech separation
US8655656B2 (en) Method and system for assessing intelligibility of speech represented by a speech signal
CN105022316B (en) Multi-band signal processor, instrument, correlation technique and data medium
US10755728B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
JP7314279B2 (en) Apparatus and method for source separation using sound quality estimation and control
CN103229517A (en) A device comprising a plurality of audio sensors and a method of operating the same
WO2005117517A2 (en) Neuroevolution-based artificial bandwidth expansion of telephone band speech
WO2012105385A1 (en) Sound segment classification device, sound segment classification method, and sound segment classification program
US20230209283A1 (en) Method for audio signal processing on a hearing system, hearing system and neural network for audio signal processing
CN116052712A (en) Voice signal processing method, device, computing equipment and storage medium
Santos et al. Performance comparison of intrusive objective speech intelligibility and quality metrics for cochlear implant users
CN115910018A (en) Method and device for improving voice privacy of mute cabin
CN115713945A (en) Audio data processing method and prediction method
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement
JP2024502595A (en) Determining Dialogue Quality Metrics for Mixed Audio Signals
Küçük et al. Direction of arrival estimation using deep neural network for hearing aid applications using smartphone
RU2782364C1 (en) Apparatus and method for isolating sources using sound quality assessment and control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination