CN112967714A

CN112967714A - Information acquisition method for English voice

Info

Publication number: CN112967714A
Application number: CN202110223067.1A
Authority: CN
Inventors: 张敏; 李琦; 丁桂芝; 牛明敏; 王晓靖; 李静
Original assignee: Zhengzhou Railway Vocational and Technical College
Current assignee: Zhengzhou Railway Vocational and Technical College
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-06-15

Abstract

The invention discloses an information acquisition method for English voice, which comprises the following steps: s1, collecting and amplifying the audio signal; s2, carrying out analog filtering on the amplified audio signal; s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.; s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree; s5, combining the corrected phonemes into the digital audio signal; s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.

Description

Information acquisition method for English voice

Technical Field

The invention relates to the technical field of audio information acquisition and processing, in particular to an information acquisition method for English voice.

Background

With the popularization of distance education, "web lessons" play an important role as a substitute and supplement for on-site lessons, and especially in english teaching, a teacher usually wants to give perfect voice to complete a classroom or training teaching, so that the pain point of the teacher can be solved by correcting pronunciation in real time through a voice intelligent method.

In the prior art, the evaluation or correction of the voice is generally realized by comparing the teaching voice with the standard voice to give a score or beautifying the voice. For example, CN202010891349.4 discloses a method for generating a self-adaptive english speech, which collects a target speech signal; carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved; aiming at the obtained signal to be reserved, performing defect recognition on the signal to be reserved by referring to a standard voice signal corresponding to English voice; and inputting the voice data containing the target voice signal into a corresponding English voice output model according to a defect recognition result, acquiring a voice output result, and obtaining the generated English voice so as to improve the accuracy and intelligence of English voice output.

However, when referring to the standard speech signal of english speech, it does not consider the essential difference between the speaker and the standard speech signal, such as different oral cavity pronunciation positions, different tones, different tone, etc., which may make the recognition of the so-called "defect" inaccurate, resulting in the distortion of the english speech output in the corresponding english speech output model, and the incoherence of sentences.

Meanwhile, voice beautification in the prior art cannot adapt to the characteristics of different speakers, beautified voice is not smooth enough, and experience effect is poor.

Disclosure of Invention

The invention aims to provide an information acquisition method for English voice, which aims to solve the problems of inaccurate recognition, inconsistent voice, output distortion and the like caused by the fact that the basic difference between a speaker and a standard voice signal in the background technology is different, such as different oral cavity pronunciation positions, different tone and the like.

In order to achieve the purpose, the invention provides the following technical scheme:

an information acquisition method for English voice comprises the following specific steps:

s1, collecting and amplifying the audio signal;

s2, carrying out analog filtering on the amplified audio signal;

s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.;

s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree;

s5, combining the corrected phonemes into the digital audio signal;

s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.

The sound source models in the standard sound source database in the S4 are of a plurality of different types;

the matching degree calculation method in S4 is specifically as follows: the matching degree is calculated by adopting a Pearson correlation coefficient mode, a plurality of characteristic parameters such as the starting time, the spectrum centroid, the spectrum flux, the fundamental tone frequency, the sharpness and the like are used as vectors, then the correlation coefficient of the vectors is calculated, and the correlation coefficient can be used as the matching degree.

The phoneme correction in S4 is to compare the phoneme with the sound source model in units of phonemes, and correct the difference between the phonemes obtained from the two models (which is out of range) based on the phonemes in the sound source model, if the phoneme difference is determined according to the phoneme correlation coefficient, the phoneme correlation coefficient includes tone, duration, pitch, unvoiced, voiced, and burst, such as/θ/is an unvoiced consonant, the vocal cords do not vibrate, and the differences from/ð/,/S/,/z/are noted, and if the sharpness and energy are large in the case of/θ/sound, the difference is determined to be large, and correction is required.

The implementation manner of the blur filtering in S6 is as follows: and combining a phase fuzzy filter working in a time domain, and performing energy smoothing treatment on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model.

The invention also discloses an English pronunciation information acquisition system, which comprises a sound source acquisition device, a front-mounted filtering module, an audio matching module, an audio synthesis module and a rear-mounted filtering output module;

the audio acquisition device is used for acquiring and amplifying audio signals;

the pre-filtering module is used for carrying out analog filtering on the amplified audio signal.

The audio matching module converts the analog filtered signals into digital signals, extracts audio characteristics of the digital audio signals, such as start time, spectrum centroid, spectrum flux, fundamental frequency, sharpness and the like, matches the audio characteristics with a sound source model in a standard sound source database, matches the digital audio signals with syllables and phonemes in the sound source model to obtain matching degree, and corrects phonemes according to the difference of the matching degree.

The audio synthesis module is used for combining the corrected phonemes into a digital audio signal;

the post-filter output module is used for carrying out fuzzy filtering on the synthesized digital audio signal and outputting the audio signal.

The audio acquisition device comprises a sensor for acquiring biological audio and a signal amplifier, the sensor is connected with the signal amplifier, the signal amplifier is connected with the pre-filtering module, and the pre-filtering module is a high-pass filter and used for filtering high-frequency noise.

The audio matching module further comprises a high-speed A/D converter so as to better reflect audio details.

The audio matching module further comprises an audio feature extraction module connected with the high-speed A/D converter, the audio feature extraction module is used for realizing digital audio signal analysis and audio feature extraction, and the audio feature extraction comprises the following parameter extraction: the attack time reflects the duration of the note energy in the rising stage; the spectrum centroid is used for reflecting an energy concentration point in the signal spectrum of the signal timbre clearness; spectral flux, the degree of variation between adjacent frames of the signal reflecting the characteristics of the note onset; the fundamental tone frequency is used for reflecting the frequency corresponding to the pitch of the single tone signal; sharpness, the energy of the high frequency part used to reflect sharpness.

The audio matching module also comprises a storage module of an English sound source database which stores a large number of sound source models of different types, and the sound source models are classified according to the audio characteristics.

The audio matching module can calculate the matching degree of the audio features of the digital audio signal and the sound source model, and determines whether to switch the sound source model for phoneme correction by taking sentences as units according to the matching degree, wherein the matching degree is comprehensively calculated according to the matching degree of a plurality of audio feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like, the matching degree can be calculated in a Pearson correlation coefficient mode, the plurality of feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like serve as vectors, and then the correlation coefficient of the vectors is calculated, and the correlation coefficient can serve as the matching degree.

The phoneme correction refers to comparing the phoneme with the sound source model by taking the phoneme as a unit, and correcting the phoneme with a large difference (exceeding the range) according to the phoneme in the sound source model.

The post-filtering in the post-filtering output module adopts a fuzzy digital filter for filtering, and energy smoothing processing is carried out on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model, and the post-filtering can be combined with a phase fuzzy filter working in a time domain.

Compared with the prior art, the method has the advantages that in order to overcome the defect that recognition is inaccurate due to different oral cavity pronunciation positions, different tones, different tone and the like, different sound source models are matched according to pronunciation characteristics of a speaker and sentences, syllables and phonemes in the sound source models, and phoneme level correction is carried out after matching; the corrected phonemes are synthesized into digital audio signals and then fuzzy filtering is carried out, so that the voice can be smoother and more natural.

Drawings

Fig. 1 is a block diagram of an english pronunciation information acquisition system.

Fig. 2 is a detailed schematic diagram of audio acquisition.

Fig. 3 is a detailed diagram of the audio matching module.

Fig. 4 is a step diagram of an information collecting method for english language voice.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 4, an information collecting method for english voice is also provided, and the specific steps of the information collecting method are as follows:

s1, collecting and amplifying the audio signal;

s2, carrying out analog filtering on the amplified audio signal;

s5, combining the corrected phonemes into the digital audio signal;

The sound source models in the standard sound source database in S4 are of a plurality of different types.

Example 2

As shown in fig. 1, a specific embodiment of the present invention is an english pronunciation information collecting system, which includes an audio collecting device 1, a pre-filter module 2, an audio matching module 3, an audio synthesizing module 4, and a post-filter output module 5;

the audio acquisition device 1 is used for acquiring and amplifying audio signals,

the pre-filter module 2 is configured to perform analog filtering on the amplified audio signal,

the audio matching module 3 converts the analog filtered signal into a digital signal, extracts audio characteristics of the digital audio signal, such as start time, spectrum centroid, spectrum flux, fundamental tone frequency, sharpness and the like, matches the audio characteristics with a sound source model in a standard sound source database, matches the digital audio signal with syllables and phonemes in the sound source model to obtain matching degree, and corrects phonemes according to the difference of the matching degree;

the audio synthesis module 4 is configured to combine the corrected phonemes into a digital audio signal;

the post-filter output module 5 is configured to perform fuzzy filtering on the synthesized digital audio signal and output an audio signal.

The further improvement lies in that, as shown in fig. 2, the audio acquisition device 1 includes a sensor 1-1 for acquiring a biological audio and a signal amplifier 1-2, the sensor 1-1 is connected with the signal amplifier 1-2, the signal amplifier 1-2 is connected with the pre-filter module 2, and the pre-filter module 2 is a high-pass filter 2' for filtering out high-frequency noise.

A further improvement is that the audio matching module 3 further comprises a high speed a/D converter 3-1, as shown in fig. 3, in order to better reflect the audio details.

In a further improvement, the audio matching module 3 further comprises an audio feature extraction module 3-2 connected to the high-speed a/D converter 3-1, the audio feature extraction module 3-2 is configured to implement digital audio signal analysis and audio feature extraction, and the audio feature extraction includes extraction of the following parameters: the attack time reflects the duration of the note energy in the rising stage; the spectrum centroid is used for reflecting an energy concentration point in the signal spectrum of the signal timbre clearness; spectral flux, the degree of variation between adjacent frames of the signal reflecting the characteristics of the note onset; the fundamental tone frequency is used for reflecting the frequency corresponding to the pitch of the single tone signal; sharpness, the energy of the high frequency part used to reflect sharpness.

The further improvement is that the audio matching module 3 further comprises a storage module 3-3 for storing an english sound source database of a large number of sound source models of different types, and the sound source models are classified according to the audio features.

The further improvement is that the audio matching module 3 can calculate the matching degree between the audio features of the digital audio signal and the sound source model, and decide whether to switch the sound source model for phoneme correction by taking a sentence as a unit according to the matching degree, wherein the matching degree is comprehensively calculated according to the matching degree of a plurality of audio feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like, the matching degree can be calculated by adopting a pearson correlation coefficient mode, a plurality of feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like are used as vectors, and then the correlation coefficient of the vectors is calculated, and the correlation coefficient can be used as the matching degree.

In a further improvement, the phoneme correction is to compare the phoneme with the sound source model by taking the phoneme as a unit, and the phoneme with a large difference (out of range) between the phoneme and the sound source model is corrected based on the phoneme in the sound source model.

The further improvement is that the post-filtering in the post-filtering output module 5 uses a fuzzy digital filter to filter, and performs energy smoothing on the corrected phonemes according to the difference between the uncorrected phonemes and the sound source model, and may be combined with a phase fuzzy filter working in the time domain.

In order to overcome the defects of inaccurate identification caused by different oral cavity pronunciation positions, different tones, different tone and the like, different sound source models are matched according to the pronunciation characteristics of a speaker and sentences, syllables and phonemes in the sound source models, and the phoneme grade is corrected after matching; the corrected phonemes are synthesized into digital audio signals and then fuzzy filtering is carried out, so that the voice can be smoother and more natural.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An information acquisition method for English voice is characterized by comprising the following steps:

s1, collecting and amplifying the audio signal;

s2, carrying out analog filtering on the amplified audio signal;

s5, combining the corrected phonemes into the digital audio signal;

2. The information acquisition method for english voice according to claim 1, characterized in that: the sound source models in the standard sound source database in S4 are of a plurality of different types.

3. The information collecting method for english voice according to claims 1 and 2, characterized in that: the matching degree calculation method in S4 is specifically as follows: and calculating the matching degree by adopting a Pearson correlation coefficient mode, and taking a plurality of characteristic parameters such as the attack time, the spectrum centroid, the spectrum flux, the fundamental frequency, the sharpness and the like as vectors.

4. The information acquisition method for english voice according to claim 1, characterized in that: the implementation manner of the blur filtering in S6 is as follows: and combining a phase fuzzy filter working in a time domain, and performing energy smoothing treatment on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model.

5. The information acquisition method for English language according to claim 4, characterized in that: and calculating a correlation coefficient of the vector, wherein the correlation coefficient is used as a matching degree.