CN108269574B

CN108269574B - Method and device for processing voice signal to represent vocal cord state of user, storage medium and electronic equipment

Info

Publication number: CN108269574B
Application number: CN201711482746.0A
Authority: CN
Inventors: 孔常青; 高建清; 鹿晓亮
Original assignee: Anhui Iflytek Medical Information Technology Co ltd
Current assignee: Iflytek Medical Technology Co ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-05-25
Anticipated expiration: 2037-12-29
Also published as: CN108269574A

Abstract

The disclosure provides a voice signal processing method and device, a storage medium and an electronic device. The method comprises the following steps: collecting voice data of a user to be tested, wherein the voice data is voiced voice data input by the user to be tested according to a preset condition; extracting acoustic features of the voice data of the user to be tested, wherein the acoustic features are used for representing vocal cord states of the user to be tested; and taking the acoustic features as input, and determining the pronunciation features of the user to be tested after the acoustic features are processed by a pre-established speech classification model. By the scheme, the pronunciation characteristics of the user to be detected can be determined through the voice signal processing technology, and the implementation process is simple and convenient.

Description

Method and device for processing voice signal to represent vocal cord state of user, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of voice processing, and in particular, to a method and an apparatus for processing a voice signal, a storage medium, and an electronic device.

Background

Speech is an analog signal carrying specific information, and has become an important means for acquiring and transmitting information in people's social life. Generally, a speech signal contains extraordinarily rich information, such as text content or semantics, voiceprint features, languages or dialects, emotions, and the like, and the speech signal processing is to extract effective speech information in a complex speech environment.

In the practical application process, the personalized information of the user can be extracted through voice signal processing, and identity recognition is carried out, for example, different speakers are recognized from a section of conversation; or, different users can be subjected to difference normalization processing through voice signal processing, common information is extracted, and speakers are classified and identified, for example, the speakers can be classified according to gender, language and the like.

Disclosure of Invention

The present disclosure provides a method and an apparatus for processing a speech signal, a storage medium, and an electronic device, which can determine the pronunciation characteristics of a user to be tested by using a speech signal processing technique.

In order to achieve the above object, the present disclosure provides a speech signal processing method, the method comprising:

collecting voice data of a user to be tested, wherein the voice data is voiced voice data input by the user to be tested according to a preset condition;

extracting acoustic features of the voice data of the user to be tested, wherein the acoustic features are used for representing vocal cord states of the user to be tested;

and taking the acoustic features as input, and determining the pronunciation features of the user to be tested after the acoustic features are processed by a pre-established speech classification model.

Optionally, if the preset condition is that the preset duration is not less than the preset duration, the acquiring the voice data of the user to be detected includes:

collecting voiced speech data input by the user to be tested in a single time; judging whether the duration of the voiced speech data input for a single time is less than the preset duration or not; if the duration of the single-input voiced speech data is not less than the preset duration, determining the single-input voiced speech data as the speech data of the user to be tested;

alternatively, the first and second electrodes may be,

collecting voiced speech data which are discontinuously input by the user to be detected for many times; judging whether the total duration of the voiced speech data which are input discontinuously for multiple times is less than the preset duration or not; and if the total duration of the voiced speech data which are input for multiple times intermittently is not less than the preset duration, determining the voiced speech data which are input for multiple times intermittently as the speech data of the user to be tested.

Optionally, if the preset condition is that the preset interruption times is not less than the preset interruption times, the acquiring the voice data of the user to be detected includes:

collecting voiced speech data which are discontinuously input by the user to be detected for many times; judging whether the interruption times of the voiced speech data which are input intermittently for multiple times are smaller than the preset interruption times or not; and if the interruption times of the voiced speech data which are input discontinuously for multiple times are not less than the preset interruption times, determining the voiced speech data which are input discontinuously for multiple times as the speech data of the user to be tested.

Optionally, the extracting the acoustic features of the voice data of the user to be tested includes:

the voice data of the user to be tested is divided into at least one voice unit, and at least one of the following characteristics of each voice unit is extracted to be used as the acoustic characteristics of the voice data of the user to be tested: energy characteristics, fundamental frequency characteristics, short-time zero-crossing rate characteristics, pause characteristics, frequency perturbation characteristics, amplitude perturbation characteristics, harmonic noise ratio, cyclic period density entropy, detrending fluctuation analysis characteristics, nonlinear fundamental frequency change characteristics, voiceprint characteristics,

wherein the content of the first and second substances,

the frequency perturbation characteristic is used to represent the variation of the acoustic pitch frequency between adjacent pitch periods,

the amplitude perturbation signature is used to represent the variation in the amplitude of the sound wave between adjacent pitch periods,

the cyclic period density entropy is used to represent the uncertainty of the periodicity of the speech signal,

the detrended fluctuation analysis feature is used to characterize speech that represents a degree of random noise self-similarity,

the nonlinear fundamental frequency change characteristic is used for representing the stationarity of the voice signal corresponding to the voice unit.

Optionally, if the voice data of N users to be tested are collected, and N is greater than or equal to 2, the extracting the acoustic features of the voice data of the users to be tested includes:

respectively extracting the acoustic features of the voice data of each user to be tested, and calculating the feature variance of the acoustic features in N × M voice units to serve as the acoustic features of the voice data of the N users to be tested, wherein M represents the number of the voice units cut out from the voice data of each user to be tested.

Optionally, the speech classification model is constructed in a manner that:

collecting sample voice data of sample users, wherein the sample voice data is voiced voice data input by the sample users according to preset conditions, and the sample users comprise normal pronunciation characteristic users and abnormal pronunciation characteristic users;

extracting acoustic features of the sample voice data;

determining a topology of the speech classification model;

and training the voice classification model by using the topological structure and the acoustic characteristics of the sample voice data until the pronunciation characteristics output by the voice classification model are consistent with the pronunciation characteristics of the sample user.

The present disclosure provides a voice signal processing apparatus, the apparatus comprising:

the voice data acquisition module is used for acquiring voice data of a user to be detected, wherein the voice data is voiced voice data input by the user to be detected according to a preset condition;

the acoustic feature extraction module is used for extracting acoustic features of the voice data of the user to be detected, and the acoustic features are used for representing vocal cord states of the user to be detected;

and the pronunciation characteristic determining module is used for determining the pronunciation characteristics of the user to be detected after the acoustic characteristics are used as input and processed by a pre-established speech classification model.

Optionally, if the preset condition is not less than a preset duration, then

The voice data acquisition module is used for acquiring voiced voice data input by the user to be detected in a single time; judging whether the duration of the voiced speech data input for a single time is less than the preset duration or not; if the duration of the single-input voiced speech data is not less than the preset duration, determining the single-input voiced speech data as the speech data of the user to be tested;

alternatively, the first and second electrodes may be,

the voice data acquisition module is used for acquiring voiced voice data which are intermittently input by the user to be detected for many times; judging whether the total duration of the voiced speech data which are input discontinuously for multiple times is less than the preset duration or not; and if the total duration of the voiced speech data which are input for multiple times intermittently is not less than the preset duration, determining the voiced speech data which are input for multiple times intermittently as the speech data of the user to be tested.

Optionally, if the preset condition is not less than a preset interruption number, the voice data acquisition module is configured to acquire voiced voice data that is intermittently input by the user to be tested for multiple times; judging whether the interruption times of the voiced speech data which are input intermittently for multiple times are smaller than the preset interruption times or not; and if the interruption times of the voiced speech data which are input discontinuously for multiple times are not less than the preset interruption times, determining the voiced speech data which are input discontinuously for multiple times as the speech data of the user to be tested.

Optionally, the acoustic feature extraction module is configured to divide the voice data of the user to be tested into at least one voice unit, and extract at least one of the following features of each voice unit as the acoustic feature of the voice data of the user to be tested: energy characteristics, fundamental frequency characteristics, short-time zero-crossing rate characteristics, pause characteristics, frequency perturbation characteristics, amplitude perturbation characteristics, harmonic noise ratio, cyclic period density entropy, detrending fluctuation analysis characteristics, nonlinear fundamental frequency change characteristics, voiceprint characteristics,

wherein the content of the first and second substances,

Optionally, if the voice data of N users to be tested is collected, and N is greater than or equal to 2, the acoustic feature extraction module is configured to extract acoustic features of the voice data of each user to be tested, and calculate a feature variance of the acoustic features in N × M voice units as the acoustic features of the voice data of the N users to be tested, where M represents the number of the voice units cut from the voice data of each user to be tested.

Optionally, the apparatus further comprises:

the system comprises a sample voice data acquisition module, a voice analysis module and a voice analysis module, wherein the sample voice data acquisition module is used for acquiring sample voice data of a sample user, the sample voice data is voiced voice data input by the sample user according to a preset condition, and the sample user comprises a normal pronunciation characteristic user and an abnormal pronunciation characteristic user;

the sample acoustic feature extraction module is used for extracting acoustic features of the sample voice data;

the topological structure determining module is used for determining the topological structure of the voice classification model;

and the voice classification model training module is used for training the voice classification model by utilizing the topological structure and the acoustic characteristics of the sample voice data until the pronunciation characteristics output by the voice classification model are consistent with the pronunciation characteristics of the sample user.

The present disclosure provides an electronic device, comprising;

the above-mentioned storage device; and

a processor to execute instructions in the storage device.

According to the scheme, voiced voice data input by a user to be detected according to preset conditions can be collected and used as voice data of the user to be detected, then acoustic characteristics representing vocal cord states of the user to be detected are extracted from the voiced voice data, the acoustic characteristics are input as models, and pronunciation characteristics of the user to be detected can be determined after model processing. According to the scheme, the implementation process is simple and convenient, the time and the labor are saved in the treatment process, and no professional skill requirement is required on personnel.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a schematic flow chart of a speech signal processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of the construction of a speech classification model according to the present disclosure;

FIG. 3 is a schematic diagram of a speech signal processing apparatus according to the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device for speech signal processing according to the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Referring to fig. 1, a flow diagram of a speech signal processing method of the present disclosure is shown. May include the steps of:

s101, voice data of a user to be tested are collected, wherein the voice data are voiced voice data input by the user to be tested according to preset conditions.

When the scheme of the disclosure is used for processing the voice signal, the voice data of the user to be detected can be collected firstly. As an example, voice data of a user to be tested can be collected through a microphone of an intelligent terminal, for example, the intelligent terminal can be a mobile phone, a personal computer, a tablet computer, an intelligent sound box and other daily electronic devices; alternatively, the intelligent terminal may also be a dedicated device, which may not be specifically limited by the present disclosure.

In the scheme of the disclosure, the pronunciation characteristics of the user to be tested can be predicted through a voice signal processing technology, for example, the pronunciation characteristics can be embodied as whether the voice is crisp and loud, whether the pronunciation is stable, and the like.

For scenes needing to predict pronunciation characteristics of users, such as broadcasting and hosting, poetry recitation and other work, the pronunciation of the users is generally needed to be crisp and loud, the breath sound is less, and certain stationarity is achieved, prediction can be carried out based on the scheme disclosed, and then personnel screening is carried out according to the prediction result; for another example, pronunciation feature level prediction is performed on family members in daily life, if the prediction result shows that the pronunciation features of the user change, it is indicated that vocal cords of the user are possibly damaged, targeted recovery training can be performed, further, if middle-aged and elderly people change the pronunciation features under the condition of excessive unused voice, parkinson prevention training and the like can be appropriately performed, and the application scene is not particularly limited by the scheme disclosed by the invention.

As an example, the disclosed solution may classify users into two types: one is a normal pronunciation characteristic user, and the sound production of the user is usually crisp and loud, has little breath sound and has certain stationarity; one is an abnormal pronunciation feature user, and the pronunciation of this type of user is generally hoarse and deep, with much breath and unstable pronunciation.

In phonetics, the sounds of vocal cords that vibrate when pronounced are called voiced sounds, vowels in most languages are voiced sounds, and nasal sounds, lateral sounds, and semivowels are also voiced sounds. In order to obtain the pronunciation characteristics of the user to be tested, the scheme of the disclosure can collect voiced speech data input by the user to be tested according to preset conditions, and perform pronunciation characteristic analysis.

The following explains the process of collecting voice data of a user to be tested according to the scheme of the present disclosure, taking vowel/a/as an example.

1. The preset condition is not less than the preset duration

As an example, voiced speech data input by a user to be tested in a single time can be collected; judging whether the duration of the voiced speech data input in a single time is less than the preset duration or not; and if the duration of the single-time input voiced speech data is not less than the preset duration, determining the single-time input voiced speech data as the speech data of the user to be tested.

For example, the preset duration may be 10s, that is, when the duration of a single vowel/a/uttered by the user to be tested is not less than 10s, the voice entry may be determined as the voice data of the user to be tested in the present disclosure.

2. The preset condition is not less than the preset duration

As an example, voiced speech data intermittently input by a user to be tested for multiple times can be collected; judging whether the total duration of the voiced speech data which are input discontinuously for many times is less than the preset duration or not; and if the total duration of the voiced speech data which are input for a plurality of times discontinuously is not less than the preset duration, determining the voiced speech data which are input for a plurality of times discontinuously as the speech data of the user to be tested.

For example, the preset duration may be 10s, that is, the user to be tested may make multiple intermittent vowels/a/, and as long as the total duration of the superposition is not less than 10s, the multiple intermittent voice entries may be determined as the voice data of the user to be tested in the present disclosure.

3. The preset condition is not less than the preset interruption times

As an example, voiced speech data intermittently input by a user to be tested for multiple times can be collected; judging whether the interruption times of voiced speech data which is input discontinuously for multiple times are smaller than the preset interruption times or not; and if the interruption times of the voiced speech data which are input discontinuously for multiple times are not less than the preset interruption times, determining the voiced speech data which are input discontinuously for multiple times as the speech data of the user to be tested.

For example, the preset number of times of interruption may be 15 times, that is, the user to be tested may intermittently make vowels/a/, and as long as the total number of times of interruption accumulated is not less than 15 times, the voice entry of the multiple times of interruption may be determined as the voice data of the user to be tested in the present disclosure.

It should be noted that, the values of the preset duration and the preset interruption times in the scheme of the present disclosure may not be limited, and may be determined specifically by combining with the actual application requirements.

As an example, a voiced speech data meeting a preset condition may be collected as the speech data of the user to be tested in the above manner; the method and the device for detecting the voiced sound data can also collect N voiced sound data meeting preset conditions as the voice data of the user to be detected, wherein N is larger than or equal to 2, for example, N is 20. The voiced speech data which is input discontinuously for multiple times in the modes 2 and 3 and meets the preset condition belongs to the speech data of the user to be tested.

S102, extracting acoustic features of the voice data of the user to be tested, wherein the acoustic features are used for representing vocal cord states of the user to be tested.

Generally, users with different pronunciation characteristics have different acoustic characteristics, and the scheme disclosed by the invention can extract the acoustic characteristics from the voice data of the user to be tested and perform pronunciation characteristic analysis according to the acoustic characteristics. In particular, the acoustic features may comprise at least one of the following features:

1. energy characteristics

For users with abnormal pronunciation characteristics, the users generally cannot make larger sounds, i.e. the speech amplitude is smaller, and accordingly, the characteristics can be characterized through energy characteristics.

As an example, the voice data of the user to be tested may be segmented into at least one voice unit, and the energy feature of each voice unit may be extracted. For example, the energy features may be embodied as energy means and/or energy variances.

For example, the voice data of the user to be tested may be subjected to framing processing to obtain a plurality of voice data frames; then, based on the short-time average energy corresponding to each voice data frame, the energy mean and/or energy variance of each voice unit is calculated.

As an example, the framing processing may be performed according to 25 ms/frame, and if the voice data of the user to be measured is 10s, 400 frames may be sequentially segmented according to 0ms to 25ms, 25ms to 50ms, and so on; or, in order to increase the number of voice data frames, a frame shifting scheme may be adopted to perform framing processing, for example, frame shifting is 10ms, and approximately 1000 frames may be split according to 0ms to 25ms, 10ms to 35ms, and 20ms to 45ms, and so on. The implementation manner of the framing processing in the present disclosure may not be specifically limited.

In the present disclosure, the voice unit may be a voice data frame; or, the voice unit may be voice data of the user to be tested, that is, energy characteristics of the whole voice data are calculated based on the short-time average energy of each voice data frame; or, the speech unit may be another self-defined interval, for example, the whole speech data may be divided into 20 intervals, if the speech data of the user to be tested is 10s, and in combination with the above example of the frame shifting scheme, each interval may include 50 speech data frames for 500ms, and the energy feature of the interval may be calculated based on the short-time average energy of the speech data frames included in each interval. The granularity of the speech units may not be specifically limited by the disclosed aspects.

It is to be understood that the implementation of the framing process, the granularity of the speech units, and the like, which are involved in the following acoustic feature extraction process, can be referred to as described herein, and are not described in detail below.

2. Fundamental frequency characteristics

When a user is voiced, airflow passes through a glottis to enable vocal cords to generate relaxation oscillation type vibration to generate a quasi-periodic excitation pulse train, the frequency of the vocal cord vibration can be called a fundamental tone frequency, the fundamental tone frequency is called a fundamental frequency for short, and a corresponding period can be called a fundamental tone period.

For users with abnormal pronunciation characteristics, the pronunciation is generally deep and the tone is single, and accordingly, the characteristic can be characterized through the fundamental frequency characteristic.

As an example, the voice data of the user to be tested may be segmented into at least one voice unit, and the fundamental frequency feature of each voice unit may be extracted. For example, the fundamental frequency features may be embodied as a fundamental frequency mean and/or a fundamental frequency variance.

3. Short time zero crossing rate characterization

While a short-term zero-crossing rate may be used to represent the number of times the speech signal waveform crosses the horizontal axis (zero level), voiced segments typically have a lower zero-crossing rate for normal pronunciation-feature users.

As an example, the voice data of the user to be tested may be segmented into at least one voice unit, and the short-time zero-crossing rate feature of each voice unit may be extracted. For example, the short-term zero-crossing rate feature may be embodied as a number of zero-crossings and/or a zero-crossing ratio, where the zero-crossing ratio may be a ratio between the number of zero-crossings of a speech unit and a total number of sampling points of the speech unit.

Taking a common 8k sampling frequency as an example, it means that there are 8 thousand sampling points per second, and the total number of sampling points included in a speech unit can be calculated by combining the duration of each speech unit. The sampling frequency may not be particularly limited in the present disclosure.

4. Pause feature

For users with abnormal pronunciation characteristics, continuous pronunciation may not be performed, which may cause pauses in other positions besides the pauses occurring in the intermittent collection of voiced speech data for multiple times, so the scheme of the present disclosure may extract the pause characteristics of the speech data of the user to be detected.

As an example, the voice data of the user to be tested may be segmented into at least one voice unit, and the pause feature of each voice unit may be extracted. For example, the stall characteristic may be embodied as a number of stalls and/or a stall duration and/or a stall proportion.

For example, a voice endpoint detection tool may be used to detect silence in voice data, where the silence exists is a place where a user pauses, and the position where the user pauses and the duration of each pause may be obtained according to the detected endpoint value, so as to extract pause features.

Specifically, the silence detection may be performed on each speech unit, so as to obtain the pause feature of each speech unit. Or, when the granularity of the speech unit is a speech data frame and a speech unit interval, in the practical application process, some speech data frames and speech data intervals may be in a pause, that is, the whole speech data frame and the whole speech data interval have no sound, and accordingly, the scheme of the present disclosure may perform silence detection on the whole speech data to obtain the total pause times and the total pause duration of the whole speech data, and a pause proportion calculated based on the total pause duration and the total duration of the whole speech data, as the pause feature of each speech unit.

5. Frequency perturbation characteristics

As an example, the voice data of the user to be tested can be divided into at least one voice unit, and the frequency perturbation feature of each voice unit is extracted, wherein the frequency perturbation feature can be used for representing the variation of the sound wave pitch frequency between adjacent pitch periods.

Generally, the frequency perturbation in the voice signal is consistent with the functional state of the glottic area, and the frequency of the normal pronunciation feature user in the adjacent period is the same, but is few, namely the frequency perturbation value is small; the frequency perturbation value of the abnormal pronunciation characteristic user is large, so that the sound is rough. For example, the frequency perturbation feature in the present disclosure may be embodied as at least one of the following physical quantities: jitter, Jitter (Abs), RAP (Chinese: Relative Average Perturbation), PPQ (Chinese: periodic Perturbation Quotient), DDP (Chinese: Difference of Differences of Periods and Average Period ratio), etc., mainly reflect the coarse degree, and secondly reflect the hoarseness degree.

Jitter is used to represent the absolute change of fundamental frequency in pronunciation process, and can be expressed as the following formula:

wherein, T_iRepresents the pitch period value of the ith speech subunit, and K represents the number of speech subunits included in the speech unit. It is understood that the granularity of the speech units may be embodied as intervals, or may be embodied as an entire piece of speech data; the granularity of a speech sub-unit may be embodied as a frame of speech data or may be embodied as at least two sub-intervals equally divided by the speech unit.

Jitter (abs) is used to represent the relative change of fundamental frequency during pronunciation, and can be expressed as the following formula:

the calculation process of RAP, PPQ, and DDP can be implemented by reference to related technologies, and is not described in detail herein.

6. Amplitude perturbation feature

As an example, the voice data of the user to be tested can be divided into at least one voice unit, and the amplitude perturbation feature of each voice unit is extracted, wherein the amplitude perturbation feature can be used for representing the variation of the sound wave amplitude between adjacent pitch periods.

In general, the amplitude perturbation value of the normal pronunciation characteristic user in the adjacent period is small, which shows that the vocal cord vibration is stable and the pronunciation stability is strong. For example, the amplitude perturbation feature in the present disclosure may be embodied as at least one of the following physical quantities: shim, shim (dB), APQ3 (English: three-point Amplitude Perturbation Quotient), APQ5 (Chinese: five-point Amplitude Perturbation Quotient), APQ (English: Amplitude Perturbation Quotient, Chinese: Amplitude Perturbation Quotient), DDA (English: Difference of Differences of Amplitudes of amplifiers, Chinese: average absolute Difference of Amplitude Differences of adjacent periods), mainly reflecting the degree of the mute.

Shimmer (db) is used to represent the absolute change in amplitude during pronunciation, and can be expressed as the following equation:

wherein A is_iRepresenting the amplitude of the ith speech subunit. In particular, the granularity of the speech sub-unit can be described with reference to the above-mentioned quantity of physics Jitter, and will not be described in detail here.

Shimmer is used to express the relative change of amplitude during pronunciation, and can be expressed as the following formula:

the calculation processes of the APQ3, the APQ5, the APQ and the DDA can be realized by referring to the related art and are not described in detail herein.

7. Harmonic to noise ratio

As an example, the voice data of the user to be tested can be divided into at least one voice unit, the Harmonic component and the Noise component of each voice unit are extracted, and the Harmonic to Noise Ratio (HNR) is calculated, which mainly reflects the degree of hoarseness.

It should be noted that the noise component in the present disclosure is not environmental noise, but is glottal noise caused by incomplete closing of the glottal when the user to be tested vocalizes. The way of extracting harmonic components and noise components and the way of calculating the harmonic-to-noise ratio can be implemented by referring to the related art, and will not be described in detail here.

8. Cyclic period density entropy

As an example, the voice data of the user to be tested can be divided into at least one voice unit, and the cyclic period density entropy of each voice unit is extracted, wherein the cyclic period density entropy can be used for representing the uncertainty of the periodicity of the voice signal.

9. Detrending fluctuation analysis features

For users with abnormal pronunciation characteristics, random noise may be generated when airflow passes through vocal cords during sounding, and the random noise is mixed in voice data of users to be tested, and accordingly, the characteristic can be represented through trend-removing fluctuation analysis characteristics.

As an example, the voice data of the user to be tested can be divided into at least one voice unit, and the detrending fluctuation analysis feature of each voice unit is extracted, wherein the detrending fluctuation analysis feature can be used for representing the voice feature of the random noise self-similarity degree.

The Detrended Fluctuation Analysis (DFA) is a new type of speech characteristics based on the nonlinear dynamical system theory, and the specific implementation process is mainly divided into two parts: and (4) solving the change trend of the voice data and analyzing the fluctuation situation of the voice data around the change trend. The specific implementation process can refer to the related art, and is not described in detail herein.

10. Non-linear fundamental frequency variation characteristic

As an example, the voice data of the user to be tested may be divided into at least one voice unit, and a nonlinear fundamental frequency variation feature of each voice unit may be extracted, where the nonlinear fundamental frequency variation feature may be used to represent stationarity of a voice signal corresponding to the voice unit. For example, the nonlinear fundamental frequency variation characteristic can be expressed as fundamental frequency cycle Entropy (PPE). The specific calculation process can be implemented by reference to related technologies, and is not described in detail here.

11. Voiceprint features

As an example, the voice data of the user to be tested may be segmented into at least one voice unit, and the voiceprint feature of each voice unit may be extracted.

For example, the voiceprint feature can be an actuator feature; alternatively, the voiceprint feature may be other voiceprint features extracted by a neural network, such as a MFCC (Mel-Frequency Cepstral Coefficients, MFCC) feature, which is not particularly limited in the present disclosure.

As an example, if voice data of N users to be tested are collected, the following acoustic features may also be extracted: and respectively extracting the acoustic features of the voice data of each user to be tested, calculating the feature variance of the acoustic features in N × M voice units to serve as the acoustic features of the voice data of the N users to be tested, wherein M represents the number of the voice units cut out from the voice data of each user to be tested.

The statistical parameter difference generated when the user to be detected pronounces at different moments can be reflected through the feature variance, and for the user with abnormal pronunciation features, the acoustic features may change along with time, so that the stability is poor.

In the practical application process, the short-time zero-crossing rate feature, the pause feature and the voiceprint feature can be extracted from the whole voice data and used as the acoustic features of each voice unit, that is, the feature variances of the acoustic features do not change, and the feature variances of the acoustic features can not be calculated. The scheme of the present disclosure may not be limited to this, and may be determined by combining the actual application requirements.

And S103, determining the pronunciation characteristics of the user to be detected after the acoustic characteristics are used as input and processed by a pre-established voice classification model.

After the acoustic features are extracted from the voice data of the user to be tested, the voice classification model established in advance can be used for carrying out model processing, and the pronunciation features of the user to be tested are output.

It should be noted that if the acoustic features are embodied as any of the above features, the acoustic features may be directly input as a model; if the acoustic features are embodied as at least two of the above features, the at least two acoustic features may be spliced and then used as a model input, and at this time, the granularity of the speech unit corresponding to each acoustic feature may be the same or different, and this may not be specifically limited in the present disclosure.

As an example, model prediction may be performed once for a user to be tested; the method and the device can also carry out multiple model prediction, determine the pronunciation characteristics of the user to be detected according to the average value of multiple prediction results, or take the pronunciation characteristics with the largest occurrence frequency in the multiple prediction results as the pronunciation characteristics of the user to be detected.

As can be seen from the above description, the implementation process of the scheme disclosed by the invention is simple and convenient, the processing process is time-saving and labor-saving, and no professional skill requirement is required on personnel. As an example, the pronunciation feature level of middle-aged and elderly people is predicted, and pronunciation features determined by the model are not used for replacing routine detection of a hospital and can assist the routine detection in judgment; and in the model prediction process, only the voice data of the user to be tested needs to be input, the specific processing process can not directly act on the user to be tested, and the physical function of the user to be tested can not be influenced.

The following explains the process of constructing a speech classification model in the present disclosure. Referring specifically to the flowchart shown in fig. 2, the method may include the following steps:

s201, sample voice data of sample users are collected, wherein the sample voice data are voiced voice data input by the sample users according to preset conditions, and the sample users comprise normal pronunciation feature users and abnormal pronunciation feature users.

During model training, sample voice data of a large number of sample users can be collected. The sample users may include normal pronunciation feature users and abnormal pronunciation feature users, and as an example, the age groups of the sample users may be similar as much as possible, which is helpful for reducing the influence of physiological characteristics caused by different ages on the classification accuracy.

The implementation process of collecting the sample voice data of the sample user can be described above with reference to S101, and is not described in detail here.

S202, extracting acoustic features of the sample voice data.

The specific implementation process can be described with reference to the above S102, and is not described in detail here.

S203, determining the topological structure of the voice classification model.

As an example, the topology in the present disclosure can be embodied as: CNN (chinese: Convolutional Neural Network), RNN (chinese: Convolutional Neural Network), DNN (Deep Neural Network), and the like, which are not particularly limited in this disclosure.

As one example, a neural network may include an input layer, a hidden layer, and an output layer. Wherein the input layer may be an acoustic feature; the hidden layer can be a layer or a plurality of layers, the number of nodes on each layer can be set to be 16-32, and sigmoid can be used as an activation function; the output layer may include 2 output nodes, which respectively represent a normal pronunciation feature user and an abnormal pronunciation feature user, for example, a normal pronunciation feature user may be represented by "0" and an abnormal pronunciation feature user may be represented by "1"; alternatively, the output layer may contain 1 output node, which represents the probability that the user to be tested is identified as a normal pronunciation feature user. The specific representation form of each layer of the neural network can not be limited by the scheme of the disclosure.

S204, training the voice classification model by using the topological structure and the acoustic characteristics of the sample voice data until the pronunciation characteristics output by the voice classification model are consistent with the pronunciation characteristics of the sample user.

And determining the topological structure of the model, and after extracting the acoustic characteristics of the sample voice data, performing model training. As an example, the training process may adopt a cross entropy criterion, and update and optimize model parameters by using a common stochastic gradient descent method, so as to ensure that when the model training is completed, the predicted pronunciation features output by the model are consistent with the pronunciation features truly possessed by the sample user. The pronunciation characteristics output by the voice classification model are consistent with the pronunciation characteristics of the sample user, and the pronunciation characteristics predicted by the model can be completely the same as the pronunciation characteristics of the sample user; alternatively, the accuracy of the model predicting the pronunciation characteristics may reach a preset value, for example, 90%, which is not specifically limited in the present disclosure.

The speech classification model of the scheme of the disclosure is mainly based on the characteristics of normal pronunciation feature users and abnormal pronunciation feature users in an acoustic level, obtains classification rules of different pronunciation features through statistical analysis and model training, and further determines the pronunciation features of a general user, namely a user to be tested, according to the classification rules.

Referring to fig. 3, a schematic diagram of the configuration of the speech signal processing apparatus of the present disclosure is shown. The apparatus may include:

the voice data acquisition module 301 is configured to acquire voice data of a user to be tested, where the voice data is voiced voice data input by the user to be tested according to a preset condition;

an acoustic feature extraction module 302, configured to extract an acoustic feature of the voice data of the user to be tested, where the acoustic feature is used to represent a vocal cord state of the user to be tested;

and the pronunciation feature determination module 303 is configured to determine the pronunciation feature of the user to be tested after the acoustic feature is used as an input and is processed by a pre-established speech classification model.

Optionally, if the preset condition is not less than a preset duration, then

alternatively, the first and second electrodes may be,

wherein the content of the first and second substances,

Optionally, the apparatus further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 4, a schematic structural diagram of an electronic device 400 for speech signal processing of the present disclosure is shown. Referring to fig. 4, electronic device 400 includes a processing component 401 that further includes one or more processors, and storage resources, represented by storage medium 402, for storing instructions, such as application programs, that are executable by processing component 401. The application stored in the storage medium 402 may include one or more modules that each correspond to a set of instructions. Further, the processing component 401 is configured to execute instructions to perform the above-described speech signal processing method.

Electronic device 400 may also include a power component 403 configured to perform power management of electronic device 400; a wired or wireless network interface 404 configured to connect the electronic device 400 to a network; and an input/output (I/O) interface 405. The electronic device 400 may operate based on an operating system stored on the storage medium 402, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of speech signal processing, the method comprising:

extracting acoustic features of the voice data of the user to be tested, wherein the acoustic features comprise that the voice data of the user to be tested is divided into at least one voice unit, and the acoustic features are extracted from each voice unit; the acoustic features are used for representing vocal cord states of the user to be tested, and the acoustic features comprise feature variances of the same acoustic features in a plurality of pieces of voice data of the user to be tested, and the feature variances are used for representing pronunciation changes of the user to be tested at different moments;

2. The method according to claim 1, wherein if the preset condition is that the preset duration is not less than a preset duration, the acquiring voice data of the user to be tested includes:

alternatively, the first and second electrodes may be,

3. The method according to claim 1, wherein if the preset condition is that the preset number of times of interruption is not less than a preset number of times of interruption, the acquiring voice data of the user to be tested includes:

4. The method of claim 1, wherein the extracting the acoustic features of the voice data of the user to be tested comprises:

wherein the content of the first and second substances,

5. The method according to claim 4, wherein if the voice data of N users to be tested are collected, and N is greater than or equal to 2, the extracting the acoustic features of the voice data of the users to be tested comprises:

6. The method according to any of claims 1 to 5, wherein the speech classification model is constructed by:

extracting acoustic features of the sample voice data;

determining a topology of the speech classification model;

7. A speech signal processing apparatus, characterized in that the apparatus comprises:

the acoustic feature extraction module is used for extracting acoustic features of the voice data of the user to be detected, and comprises the steps of dividing the voice data of the user to be detected into at least one voice unit and extracting the acoustic features from the voice units; the acoustic features are used for representing vocal cord states of the user to be tested, and the acoustic features comprise feature variances of the same acoustic features in a plurality of pieces of voice data of the user to be tested, and the feature variances are used for representing pronunciation changes of the user to be tested at different moments;

8. The apparatus of claim 7, wherein the predetermined condition is not less than a predetermined duration, then

alternatively, the first and second electrodes may be,

9. The apparatus according to claim 7, wherein the predetermined condition is not less than a predetermined number of interruptions, then

The voice data acquisition module is used for acquiring voiced voice data which are intermittently input by the user to be detected for many times; judging whether the interruption times of the voiced speech data which are input intermittently for multiple times are smaller than the preset interruption times or not; and if the interruption times of the voiced speech data which are input discontinuously for multiple times are not less than the preset interruption times, determining the voiced speech data which are input discontinuously for multiple times as the speech data of the user to be tested.

10. The apparatus of claim 7,

the acoustic feature extraction module is configured to divide the voice data of the user to be tested into at least one voice unit, and extract at least one of the following features of each voice unit as an acoustic feature of the voice data of the user to be tested: energy characteristics, fundamental frequency characteristics, short-time zero-crossing rate characteristics, pause characteristics, frequency perturbation characteristics, amplitude perturbation characteristics, harmonic noise ratio, cyclic period density entropy, detrending fluctuation analysis characteristics, nonlinear fundamental frequency change characteristics, voiceprint characteristics,

wherein the content of the first and second substances,

11. The device of claim 10, wherein if the voice data of N users to be tested are collected, N is greater than or equal to 2

The acoustic feature extraction module is used for respectively extracting acoustic features of the voice data of each user to be tested, calculating feature variances of the acoustic features in N × M voice units, and taking the feature variances as the acoustic features of the voice data of the N users to be tested, wherein M represents the number of the voice units cut out from the voice data of each user to be tested.

12. The apparatus of any one of claims 7 to 11, further comprising:

13. A storage device having stored therein a plurality of instructions, wherein said instructions are loaded by a processor for performing the steps of the method of any of claims 1 to 6.

14. An electronic device, characterized in that the electronic device comprises:

the storage device of claim 13; and

a processor to execute instructions in the storage device.