CN112967714A - Information acquisition method for English voice - Google Patents

Information acquisition method for English voice Download PDF

Info

Publication number
CN112967714A
CN112967714A CN202110223067.1A CN202110223067A CN112967714A CN 112967714 A CN112967714 A CN 112967714A CN 202110223067 A CN202110223067 A CN 202110223067A CN 112967714 A CN112967714 A CN 112967714A
Authority
CN
China
Prior art keywords
audio signal
sound source
signal
phonemes
matching degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110223067.1A
Other languages
Chinese (zh)
Inventor
张敏
李琦
丁桂芝
牛明敏
王晓靖
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Railway Vocational and Technical College
Original Assignee
Zhengzhou Railway Vocational and Technical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Railway Vocational and Technical College filed Critical Zhengzhou Railway Vocational and Technical College
Priority to CN202110223067.1A priority Critical patent/CN112967714A/en
Publication of CN112967714A publication Critical patent/CN112967714A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses an information acquisition method for English voice, which comprises the following steps: s1, collecting and amplifying the audio signal; s2, carrying out analog filtering on the amplified audio signal; s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.; s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree; s5, combining the corrected phonemes into the digital audio signal; s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.

Description

Information acquisition method for English voice
Technical Field
The invention relates to the technical field of audio information acquisition and processing, in particular to an information acquisition method for English voice.
Background
With the popularization of distance education, "web lessons" play an important role as a substitute and supplement for on-site lessons, and especially in english teaching, a teacher usually wants to give perfect voice to complete a classroom or training teaching, so that the pain point of the teacher can be solved by correcting pronunciation in real time through a voice intelligent method.
In the prior art, the evaluation or correction of the voice is generally realized by comparing the teaching voice with the standard voice to give a score or beautifying the voice. For example, CN202010891349.4 discloses a method for generating a self-adaptive english speech, which collects a target speech signal; carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved; aiming at the obtained signal to be reserved, performing defect recognition on the signal to be reserved by referring to a standard voice signal corresponding to English voice; and inputting the voice data containing the target voice signal into a corresponding English voice output model according to a defect recognition result, acquiring a voice output result, and obtaining the generated English voice so as to improve the accuracy and intelligence of English voice output.
However, when referring to the standard speech signal of english speech, it does not consider the essential difference between the speaker and the standard speech signal, such as different oral cavity pronunciation positions, different tones, different tone, etc., which may make the recognition of the so-called "defect" inaccurate, resulting in the distortion of the english speech output in the corresponding english speech output model, and the incoherence of sentences.
Meanwhile, voice beautification in the prior art cannot adapt to the characteristics of different speakers, beautified voice is not smooth enough, and experience effect is poor.
Disclosure of Invention
The invention aims to provide an information acquisition method for English voice, which aims to solve the problems of inaccurate recognition, inconsistent voice, output distortion and the like caused by the fact that the basic difference between a speaker and a standard voice signal in the background technology is different, such as different oral cavity pronunciation positions, different tone and the like.
In order to achieve the purpose, the invention provides the following technical scheme:
an information acquisition method for English voice comprises the following specific steps:
s1, collecting and amplifying the audio signal;
s2, carrying out analog filtering on the amplified audio signal;
s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.;
s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree;
s5, combining the corrected phonemes into the digital audio signal;
s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.
The sound source models in the standard sound source database in the S4 are of a plurality of different types;
the matching degree calculation method in S4 is specifically as follows: the matching degree is calculated by adopting a Pearson correlation coefficient mode, a plurality of characteristic parameters such as the starting time, the spectrum centroid, the spectrum flux, the fundamental tone frequency, the sharpness and the like are used as vectors, then the correlation coefficient of the vectors is calculated, and the correlation coefficient can be used as the matching degree.
The phoneme correction in S4 is to compare the phoneme with the sound source model in units of phonemes, and correct the difference between the phonemes obtained from the two models (which is out of range) based on the phonemes in the sound source model, if the phoneme difference is determined according to the phoneme correlation coefficient, the phoneme correlation coefficient includes tone, duration, pitch, unvoiced, voiced, and burst, such as/θ/is an unvoiced consonant, the vocal cords do not vibrate, and the differences from/ð/,/S/,/z/are noted, and if the sharpness and energy are large in the case of/θ/sound, the difference is determined to be large, and correction is required.
The implementation manner of the blur filtering in S6 is as follows: and combining a phase fuzzy filter working in a time domain, and performing energy smoothing treatment on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model.
The invention also discloses an English pronunciation information acquisition system, which comprises a sound source acquisition device, a front-mounted filtering module, an audio matching module, an audio synthesis module and a rear-mounted filtering output module;
the audio acquisition device is used for acquiring and amplifying audio signals;
the pre-filtering module is used for carrying out analog filtering on the amplified audio signal.
The audio matching module converts the analog filtered signals into digital signals, extracts audio characteristics of the digital audio signals, such as start time, spectrum centroid, spectrum flux, fundamental frequency, sharpness and the like, matches the audio characteristics with a sound source model in a standard sound source database, matches the digital audio signals with syllables and phonemes in the sound source model to obtain matching degree, and corrects phonemes according to the difference of the matching degree.
The audio synthesis module is used for combining the corrected phonemes into a digital audio signal;
the post-filter output module is used for carrying out fuzzy filtering on the synthesized digital audio signal and outputting the audio signal.
The audio acquisition device comprises a sensor for acquiring biological audio and a signal amplifier, the sensor is connected with the signal amplifier, the signal amplifier is connected with the pre-filtering module, and the pre-filtering module is a high-pass filter and used for filtering high-frequency noise.
The audio matching module further comprises a high-speed A/D converter so as to better reflect audio details.
The audio matching module further comprises an audio feature extraction module connected with the high-speed A/D converter, the audio feature extraction module is used for realizing digital audio signal analysis and audio feature extraction, and the audio feature extraction comprises the following parameter extraction: the attack time reflects the duration of the note energy in the rising stage; the spectrum centroid is used for reflecting an energy concentration point in the signal spectrum of the signal timbre clearness; spectral flux, the degree of variation between adjacent frames of the signal reflecting the characteristics of the note onset; the fundamental tone frequency is used for reflecting the frequency corresponding to the pitch of the single tone signal; sharpness, the energy of the high frequency part used to reflect sharpness.
The audio matching module also comprises a storage module of an English sound source database which stores a large number of sound source models of different types, and the sound source models are classified according to the audio characteristics.
The audio matching module can calculate the matching degree of the audio features of the digital audio signal and the sound source model, and determines whether to switch the sound source model for phoneme correction by taking sentences as units according to the matching degree, wherein the matching degree is comprehensively calculated according to the matching degree of a plurality of audio feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like, the matching degree can be calculated in a Pearson correlation coefficient mode, the plurality of feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like serve as vectors, and then the correlation coefficient of the vectors is calculated, and the correlation coefficient can serve as the matching degree.
The phoneme correction refers to comparing the phoneme with the sound source model by taking the phoneme as a unit, and correcting the phoneme with a large difference (exceeding the range) according to the phoneme in the sound source model.
The post-filtering in the post-filtering output module adopts a fuzzy digital filter for filtering, and energy smoothing processing is carried out on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model, and the post-filtering can be combined with a phase fuzzy filter working in a time domain.
Compared with the prior art, the method has the advantages that in order to overcome the defect that recognition is inaccurate due to different oral cavity pronunciation positions, different tones, different tone and the like, different sound source models are matched according to pronunciation characteristics of a speaker and sentences, syllables and phonemes in the sound source models, and phoneme level correction is carried out after matching; the corrected phonemes are synthesized into digital audio signals and then fuzzy filtering is carried out, so that the voice can be smoother and more natural.
Drawings
Fig. 1 is a block diagram of an english pronunciation information acquisition system.
Fig. 2 is a detailed schematic diagram of audio acquisition.
Fig. 3 is a detailed diagram of the audio matching module.
Fig. 4 is a step diagram of an information collecting method for english language voice.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 4, an information collecting method for english voice is also provided, and the specific steps of the information collecting method are as follows:
s1, collecting and amplifying the audio signal;
s2, carrying out analog filtering on the amplified audio signal;
s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.;
s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree;
s5, combining the corrected phonemes into the digital audio signal;
s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.
The sound source models in the standard sound source database in S4 are of a plurality of different types.
The matching degree calculation method in S4 is specifically as follows: the matching degree is calculated by adopting a Pearson correlation coefficient mode, a plurality of characteristic parameters such as the starting time, the spectrum centroid, the spectrum flux, the fundamental tone frequency, the sharpness and the like are used as vectors, then the correlation coefficient of the vectors is calculated, and the correlation coefficient can be used as the matching degree.
The phoneme correction in S4 is to compare the phoneme with the sound source model in units of phonemes, and correct the difference between the phonemes obtained from the two models (which is out of range) based on the phonemes in the sound source model, if the phoneme difference is determined according to the phoneme correlation coefficient, the phoneme correlation coefficient includes tone, duration, pitch, unvoiced, voiced, and burst, such as/θ/is an unvoiced consonant, the vocal cords do not vibrate, and the differences from/ð/,/S/,/z/are noted, and if the sharpness and energy are large in the case of/θ/sound, the difference is determined to be large, and correction is required.
The implementation manner of the blur filtering in S6 is as follows: and combining a phase fuzzy filter working in a time domain, and performing energy smoothing treatment on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model.
Example 2
As shown in fig. 1, a specific embodiment of the present invention is an english pronunciation information collecting system, which includes an audio collecting device 1, a pre-filter module 2, an audio matching module 3, an audio synthesizing module 4, and a post-filter output module 5;
the audio acquisition device 1 is used for acquiring and amplifying audio signals,
the pre-filter module 2 is configured to perform analog filtering on the amplified audio signal,
the audio matching module 3 converts the analog filtered signal into a digital signal, extracts audio characteristics of the digital audio signal, such as start time, spectrum centroid, spectrum flux, fundamental tone frequency, sharpness and the like, matches the audio characteristics with a sound source model in a standard sound source database, matches the digital audio signal with syllables and phonemes in the sound source model to obtain matching degree, and corrects phonemes according to the difference of the matching degree;
the audio synthesis module 4 is configured to combine the corrected phonemes into a digital audio signal;
the post-filter output module 5 is configured to perform fuzzy filtering on the synthesized digital audio signal and output an audio signal.
The further improvement lies in that, as shown in fig. 2, the audio acquisition device 1 includes a sensor 1-1 for acquiring a biological audio and a signal amplifier 1-2, the sensor 1-1 is connected with the signal amplifier 1-2, the signal amplifier 1-2 is connected with the pre-filter module 2, and the pre-filter module 2 is a high-pass filter 2' for filtering out high-frequency noise.
A further improvement is that the audio matching module 3 further comprises a high speed a/D converter 3-1, as shown in fig. 3, in order to better reflect the audio details.
In a further improvement, the audio matching module 3 further comprises an audio feature extraction module 3-2 connected to the high-speed a/D converter 3-1, the audio feature extraction module 3-2 is configured to implement digital audio signal analysis and audio feature extraction, and the audio feature extraction includes extraction of the following parameters: the attack time reflects the duration of the note energy in the rising stage; the spectrum centroid is used for reflecting an energy concentration point in the signal spectrum of the signal timbre clearness; spectral flux, the degree of variation between adjacent frames of the signal reflecting the characteristics of the note onset; the fundamental tone frequency is used for reflecting the frequency corresponding to the pitch of the single tone signal; sharpness, the energy of the high frequency part used to reflect sharpness.
The further improvement is that the audio matching module 3 further comprises a storage module 3-3 for storing an english sound source database of a large number of sound source models of different types, and the sound source models are classified according to the audio features.
The further improvement is that the audio matching module 3 can calculate the matching degree between the audio features of the digital audio signal and the sound source model, and decide whether to switch the sound source model for phoneme correction by taking a sentence as a unit according to the matching degree, wherein the matching degree is comprehensively calculated according to the matching degree of a plurality of audio feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like, the matching degree can be calculated by adopting a pearson correlation coefficient mode, a plurality of feature parameters such as the starting time, the spectrum centroid, the spectrum flux, the pitch frequency, the sharpness and the like are used as vectors, and then the correlation coefficient of the vectors is calculated, and the correlation coefficient can be used as the matching degree.
In a further improvement, the phoneme correction is to compare the phoneme with the sound source model by taking the phoneme as a unit, and the phoneme with a large difference (out of range) between the phoneme and the sound source model is corrected based on the phoneme in the sound source model.
The further improvement is that the post-filtering in the post-filtering output module 5 uses a fuzzy digital filter to filter, and performs energy smoothing on the corrected phonemes according to the difference between the uncorrected phonemes and the sound source model, and may be combined with a phase fuzzy filter working in the time domain.
In order to overcome the defects of inaccurate identification caused by different oral cavity pronunciation positions, different tones, different tone and the like, different sound source models are matched according to the pronunciation characteristics of a speaker and sentences, syllables and phonemes in the sound source models, and the phoneme grade is corrected after matching; the corrected phonemes are synthesized into digital audio signals and then fuzzy filtering is carried out, so that the voice can be smoother and more natural.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. An information acquisition method for English voice is characterized by comprising the following steps:
s1, collecting and amplifying the audio signal;
s2, carrying out analog filtering on the amplified audio signal;
s3, converting the analog filtered signal into a digital signal and extracting the audio characteristic parameters of the digital audio signal: attack time, spectral centroid, spectral flux, fundamental frequency, sharpness, etc.;
s4, matching the audio characteristic parameters with a sound source model in a standard sound source database, then matching the digital audio signal with syllables and phonemes in the sound source model to obtain a matching degree, and performing phoneme correction according to the difference of the matching degree;
s5, combining the corrected phonemes into the digital audio signal;
s6, performing blur filtering on the synthesized digital audio signal, and outputting the audio signal.
2. The information acquisition method for english voice according to claim 1, characterized in that: the sound source models in the standard sound source database in S4 are of a plurality of different types.
3. The information collecting method for english voice according to claims 1 and 2, characterized in that: the matching degree calculation method in S4 is specifically as follows: and calculating the matching degree by adopting a Pearson correlation coefficient mode, and taking a plurality of characteristic parameters such as the attack time, the spectrum centroid, the spectrum flux, the fundamental frequency, the sharpness and the like as vectors.
4. The information acquisition method for english voice according to claim 1, characterized in that: the implementation manner of the blur filtering in S6 is as follows: and combining a phase fuzzy filter working in a time domain, and performing energy smoothing treatment on the corrected phonemes according to the difference value between the uncorrected phonemes and the sound source model.
5. The information acquisition method for English language according to claim 4, characterized in that: and calculating a correlation coefficient of the vector, wherein the correlation coefficient is used as a matching degree.
CN202110223067.1A 2021-03-01 2021-03-01 Information acquisition method for English voice Withdrawn CN112967714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110223067.1A CN112967714A (en) 2021-03-01 2021-03-01 Information acquisition method for English voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110223067.1A CN112967714A (en) 2021-03-01 2021-03-01 Information acquisition method for English voice

Publications (1)

Publication Number Publication Date
CN112967714A true CN112967714A (en) 2021-06-15

Family

ID=76275954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110223067.1A Withdrawn CN112967714A (en) 2021-03-01 2021-03-01 Information acquisition method for English voice

Country Status (1)

Country Link
CN (1) CN112967714A (en)

Similar Documents

Publication Publication Date Title
US11322155B2 (en) Method and apparatus for establishing voiceprint model, computer device, and storage medium
CN103928023B (en) A kind of speech assessment method and system
CN106531185B (en) voice evaluation method and system based on voice similarity
CN103617799B (en) A kind of English statement pronunciation quality detection method being adapted to mobile device
CN108847215B (en) Method and device for voice synthesis based on user timbre
Felps et al. Foreign accent conversion through concatenative synthesis in the articulatory domain
CN105825852A (en) Oral English reading test scoring method
CN101930747A (en) Method and device for converting voice into mouth shape image
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN113436606B (en) Original sound speech translation method
CN110349565B (en) Auxiliary pronunciation learning method and system for hearing-impaired people
CN115050387A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
JP2002091472A (en) Rhythm display device, and reproducing device and similarity judging device for voice language and voice language processor and recording medium
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
JPH05307399A (en) Voice analysis system
CN108428458A (en) A kind of vocality study electron assistant articulatory system
CN112967538B (en) English pronunciation information acquisition system
CN112967714A (en) Information acquisition method for English voice
CN115985310A (en) Dysarthria voice recognition method based on multi-stage audio-visual fusion
CN107919115A (en) A kind of feature compensation method based on nonlinear spectral conversion
CN110164414B (en) Voice processing method and device and intelligent equipment
CN112951208B (en) Method and device for speech recognition
CN110033786B (en) Gender judgment method, device, equipment and readable storage medium
CN113129923A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
Koreman Decoding linguistic information in the glottal airflow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210615

WW01 Invention patent application withdrawn after publication