CN113571043A - Dialect simulation force evaluation method and device, electronic equipment and storage medium - Google Patents

Dialect simulation force evaluation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113571043A
CN113571043A CN202110850935.9A CN202110850935A CN113571043A CN 113571043 A CN113571043 A CN 113571043A CN 202110850935 A CN202110850935 A CN 202110850935A CN 113571043 A CN113571043 A CN 113571043A
Authority
CN
China
Prior art keywords
reference template
dialect
predicted
speech
frequency spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110850935.9A
Other languages
Chinese (zh)
Inventor
马金龙
熊佳
王伟喆
曾锐鸿
罗箫
焦南凯
盘子圣
徐志坚
谢睿
陈光尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huancheng Culture Media Co ltd
Original Assignee
Guangzhou Huancheng Culture Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huancheng Culture Media Co ltd filed Critical Guangzhou Huancheng Culture Media Co ltd
Priority to CN202110850935.9A priority Critical patent/CN113571043A/en
Publication of CN113571043A publication Critical patent/CN113571043A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the conventional simulation evaluation scheme. The method comprises the following steps: receiving a voice signal input by a user; extracting a predicted comment speech feature from the speech signal; matching the preset dialect speech feature reference template to obtain a target reference template of the predicted comment speech feature; and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template.

Description

Dialect simulation force evaluation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice evaluation, in particular to a dialect simulation force evaluation method and device, electronic equipment and a storage medium.
Background
With the rapid development of 5G and artificial intelligence and the rise of broad entertainment products such as live broadcast and short video, more and more social interaction play methods are developed, so that voice play methods based on regional characteristics are greatly created, for example, IP (Internet protocol) locking address positions are unlocked to find play partners, content recommendation is made based on a registered place, room creation is realized based on regions, and the like. In addition, the users in the same dialect can be searched for joining the platform by listing through dialect simulation, and the social relationship among the users is increased.
Currently, mainstream simulation evaluation schemes mainly comprise two types, one type is a word-level spoken language evaluation algorithm based on voice recognition, and the other type is a video distortion evaluation method based on traditional voice signal processing.
Then, the word-level spoken language evaluation algorithm based on the speech recognition needs a complete speech recognition system, so that a large number of dialect data sets are needed for model training by adopting the scheme, and a C/S interaction system is also needed to be constructed, so that the algorithm complexity is high, and the method is not beneficial to being realized on the end. The video distortion evaluation method based on the traditional voice signal processing has high requirement on the consistency of the reference voice and the voice to be measured, and the alignment of the reference voice and the voice to be measured and the consistency of the duration are required to be ensured as much as possible, so that the anti-interference and robustness are poor. Devices and scenarios that are tailored to specific periodic utterances are not well suited for non-human-specific dialect scenarios.
Disclosure of Invention
The invention provides a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the conventional simulation evaluation scheme.
The invention provides a dialect simulated force evaluation method, which comprises the following steps:
receiving a voice signal input by a user;
extracting a predicted comment speech feature from the speech signal;
matching the preset dialect speech feature reference template to obtain a target reference template of the predicted comment speech feature;
and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template.
Optionally, the step of extracting the predicted comment speech feature from the speech signal includes:
preprocessing the voice signal to obtain a preprocessed signal;
extracting a predicted comment speech feature from the preprocessed signal.
Optionally, the step of extracting the predicted comment speech feature from the preprocessed signal includes:
performing fast Fourier transform on the preprocessed voice signal to obtain a frequency spectrum of the preprocessed voice signal;
solving a square value of the frequency spectrum to obtain a short-time energy spectrum;
obtaining a magnitude spectrum of the frequency spectrum, and converting the magnitude spectrum into a Mel frequency spectrum;
calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
and performing discrete cosine transform on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment voice feature of the preprocessed voice signal.
Optionally, the step of obtaining the target reference template of the predicted comment speech feature by matching from the preset dialect speech feature reference template includes:
and performing dynamic time warping calculation on the predicted comment sound characteristic and a preset dialect speech characteristic reference template, and matching a target reference template of the predicted comment sound characteristic in the preset dialect speech characteristic reference template according to a calculation result.
The invention also provides a dialect simulation force evaluation device, which comprises:
the voice signal receiving module is used for receiving a voice signal input by a user;
the predictive comment sound feature extraction module is used for extracting predictive comment sound features from the voice signal;
the target reference template matching module is used for matching a preset dialect speech feature reference template to obtain a target reference template of the predicted comment speech feature;
and the simulated force score calculating module is used for calculating the simulated force score of the predicted and evaluated voice characteristics according to the target reference template.
Optionally, the predictive comment speech feature extraction module includes:
the preprocessing submodule is used for preprocessing the voice signal to obtain a preprocessed signal;
and the predicted comment sound characteristic extraction sub-module is used for extracting predicted comment sound characteristics from the preprocessed signal.
Optionally, the predicted comment speech feature extraction sub-module includes:
the frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
the short-time energy spectrum calculating unit is used for solving a square value of the frequency spectrum to obtain a short-time energy spectrum;
the Mel frequency spectrum calculating unit is used for acquiring the magnitude spectrum of the frequency spectrum and converting the magnitude spectrum into a Mel frequency spectrum;
a logarithm calculation unit, configured to calculate a logarithm of the mel-frequency spectrum according to the short-time energy spectrum and the mel-frequency spectrum;
and the predicted comment sound characteristic calculation unit is used for performing discrete cosine transform on the logarithm to obtain a mel frequency cepstrum coefficient through calculation, and taking the mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
Optionally, the target reference template matching module includes:
and the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment sound characteristic and a preset dialect speech characteristic reference template, and matching the target reference template of the predicted comment sound characteristic in the preset dialect speech characteristic reference template according to a calculation result.
The invention also provides an electronic device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the dialect simulation force evaluation method according to instructions in the program code.
The present invention also provides a computer-readable storage medium for storing program code for performing the dialect simulation force evaluation method as described in any one of the above.
According to the technical scheme, the invention has the following advantages: the invention receives the voice signal input by the user; extracting a predicted comment speech feature from the speech signal; matching the preset dialect voice feature reference templates to obtain a target reference template for predicting the comment voice features; and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart illustrating steps of a dialect simulation force evaluation method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the steps of a dialect modeling power assessment method according to another embodiment of the present invention;
FIG. 3 is a frequency diagram of a Mel filter bank according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an MFCC extraction process according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a DTW algorithm according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an android integration flow of the dialect simulation force evaluation method provided by the embodiment of the present invention;
fig. 7 is a block diagram of a dialect simulation force evaluation device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a dialect simulation force evaluation method, a dialect simulation force evaluation device, electronic equipment and a storage medium, which are used for solving the technical problems of high algorithm complexity, limited model generalization capability and low calculation efficiency of the conventional simulation evaluation scheme.
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating a dialect simulation force evaluation method according to an embodiment of the present invention.
The dialect simulated force evaluation method provided by the invention specifically comprises the following steps:
step 101, receiving a voice signal input by a user;
in the embodiment of the present invention, the voice signal may be a human voice signal generated when the user practices dialect, or may be a prerecorded audio signal. The present invention is not particularly limited in this regard.
Step 102, extracting predicted comment sound characteristics from a voice signal;
it should be noted that the purpose of extracting the predicted speech feature from the speech signal is to extract a component having identification in the speech signal and remove the interference information (such as background noise, emotion, etc.). So as to reduce the interference of interference information on the voice analysis and reduce the calculation amount of the whole dialect simulation force evaluation process.
103, matching the preset dialect voice feature reference template to obtain a target reference template for predicting the comment voice feature;
after the predicted comment speech features are extracted, the corresponding target reference template can be matched in the preset dialect speech feature reference template.
It should be noted that, in order to implement the matching process of the target reference template, the speech feature reference templates of different dialects need to be stored in advance. The speech feature reference template can be obtained by training speech signals of different dialects.
And 104, calculating the simulation force score of the speech characteristic to be evaluated according to the target reference template.
After the target reference template is obtained, the simulated force score may be determined based on the similarity of the predicted speech characteristics to the target reference template.
In an embodiment of the invention, the simulated force score is used to characterize how similar the speech signal input by the user is to the dialect template as a reference. The higher the score, the more accurate the simulation of the dialect is on behalf of the user.
The invention receives the voice signal input by the user; extracting a predicted comment speech feature from the speech signal; matching the preset dialect voice feature reference templates to obtain a target reference template for predicting the comment voice features; and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating a dialect simulation force evaluation method according to another embodiment of the present invention. The method specifically comprises the following steps:
step 201, receiving a voice signal input by a user;
step 202, preprocessing a voice signal to obtain a preprocessed signal;
before the voice signal is analyzed and processed, the voice signal must be preprocessed so as to eliminate the influence of aliasing, higher harmonic distortion, high frequency and other factors on the quality of the voice signal caused by the human vocal organs and equipment for acquiring the voice signal. And signals obtained by subsequent voice processing are ensured to be more uniform and smooth as far as possible.
In embodiments of the present invention, the pre-processing may include framing, windowing, and pre-emphasis. The framing is to cut off the voice signal according to short-time stationarity, the frame length generally adopts 20ms, and the frame shift generally adopts 10 ms; the windowing generally adopts a Hamming window or a Hanning window, because the width of a main lobe corresponds to the frequency resolution, the wider the width of the main lobe is, the lower the frequency resolution corresponding to the main lobe is, the energy is concentrated on the main lobe as much as possible when a window function is selected, or the relative amplitude of the maximum side lobe height is as small as possible, and the side lobe attenuation of the Hamming window in the amplitude-frequency characteristic is larger, and the Gibbs effect can be reduced, so the Hamming window is generally selected for the windowing processing of the voice signal; because the voice signal is easily affected by glottal excitation and oral-nasal radiation, 6 dB/octave attenuation can occur in the frequency component above 800Hz, so that the high-frequency part energy needs to be promoted by a pre-emphasis method, the high-frequency loss is compensated by using a machine, and a first-order high-pass filter 1-0.9375Z is generally adopted-1To implement pre-emphasis. In addition, the pre-processing may also include anti-aliasing filtering.
Step 203, extracting the predicted comment sound characteristics from the preprocessed signals;
in practical applications, the feature extraction may include time domain feature parameter extraction and Frequency domain feature parameter extraction, where the time domain feature parameters include a short-time zero crossing rate, a short-time energy spectrum and a pitch period, and the Frequency domain feature parameters include LPCC (Linear Predictive Cepstral Coding, Linear Predictive Cepstral Coefficient), Δ LPCC (first order difference Linear Predictive Cepstral Coefficient), MFCC (Mel Frequency Cepstral Coefficient), and Δ MFCC (first order difference Mel Frequency Cepstral system).
In one example, taking MFCC as the predicted speech feature, the step of extracting the predicted speech feature from the preprocessed signal may include:
s31, performing fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
in the embodiment of the present invention, a process of performing Fast Fourier Transform (FFT) on a preprocessed voice signal to obtain a frequency spectrum thereof is shown by the following formula:
Figure BDA0003182307140000061
wherein, x (k) is the frequency spectrum of the preprocessed voice signal; n represents the number of points of the fourier transform, typically 256 or 512 points; n is the position of the speech signal (usually 320 points of a frame signal, n is 0-320); x (N) is an input speech signal, k is a k-th point, and k is 0, 1.
S32, solving a square value of the frequency spectrum to obtain a short-time energy spectrum;
in the embodiment of the invention, the calculation formula of the short-time energy spectrum is as follows:
Figure 1
s33, obtaining a magnitude spectrum of the frequency spectrum, and converting the magnitude spectrum into a Mel frequency spectrum;
in a particular implementation, the magnitude spectrum may be changed to a mel-frequency spectrum H (k, m) with a mel-filter bank;
Figure BDA0003182307140000072
wherein f (k) represents the actual frequency fc() Representing the center frequency, m being the order of the mel-filter.
In one example, the mel filter bank frequencies are as shown in fig. 3.
S34, calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
in the embodiment of the present invention, the logarithm X' (m) of the mel-frequency spectrum obtained by performing logarithmic transformation on the mel-frequency spectrum can be represented by the following formula:
Figure BDA0003182307140000073
s35, performing Discrete Cosine Transform (DCT) on the logarithm, calculating to obtain mel-frequency cepstrum coefficients, and using the mel-frequency cepstrum coefficients as the predicted comment speech features of the preprocessed speech signal.
In the embodiment of the present invention, the mel-frequency cepstrum coefficient mfcc (r) can be calculated by the following formula:
Figure BDA0003182307140000074
it should be noted that the MFCC features obtained through the above calculation steps are static parameters, which can well reflect the static features of the speech, but do not fully utilize the dynamic features of the speech, so in an alternative example, a first-order and a second-order difference parameters of the MFCC may be added on the basis of the MFCC features as predicted speech features to better describe the time-varying characteristics of the speech signal.
In one example, the MFCC extraction flow is shown in FIG. 4.
Step 204, matching the preset dialect speech feature reference template to obtain a target reference template for predicting the comment speech feature;
it should be noted that, in order to implement the matching process of the target reference template, the speech feature reference templates of different dialects need to be stored in advance.
In a specific implementation, for the generation of the speech feature reference template, the same process as that of the step 201 and 203 may be performed for each reference speech, that is, the reference speech is obtained after preprocessing and feature extraction are performed. And then locally stored according to actual conditions. When the simulation force evaluation is actually performed on the user voice signal, the reference template can be directly loaded, and the corresponding characteristic parameters are read to determine the target reference template.
In one example, the step of matching a target reference template of the predicted comment speech feature from the preset dialect speech feature reference templates may include:
and performing dynamic time warping calculation on the predicted comment sound characteristic and the preset dialect speech characteristic reference template, and matching the target reference template of the predicted comment sound characteristic in the preset dialect speech characteristic reference template according to the calculation result.
Dynamic Time Warping (DTW) is a non-linear Warping technique that combines Time Warping with distance measure computation, and finds a Warping function im=φ(in) The time axis n of the test vector (vector of the predicted evaluation feature) is nonlinearly mapped onto the time axis m of the dialect speech feature reference template, and the function is satisfied:
Figure BDA0003182307140000081
wherein D is the distance between two vectors under the condition of optimal time warping; t (i)n) For the feature vector to be measured, R (phi (i)n) Is a reference template feature vector, phi (i)n) Is inRegular function of inThe number of feature frames.
Since the DTW continuously calculates the distance between the two vectors to find the optimal matching path, the warping function corresponding to the minimum cumulative distance when the two vectors are matched is obtained, which ensures the maximum acoustic similarity between them. The essence of the DTW algorithm is to use the idea of dynamic programming to automatically find a path by using a local optimization process, along which the cumulative distortion between two feature vectors is the smallest, thereby avoiding errors that may be introduced due to different durations. The DTW algorithm requires that the reference template and the test template use the same type of feature vector, the same frame length, the same window function, and the same frame shift.
The principle of the DTW algorithm is shown in fig. 5, where each frame number N-1-N of the predicted feature is marked on the horizontal axis of a two-dimensional rectangular coordinate system, each frame M-1-M of the dialect speech feature reference template is marked on the vertical axis, and each intersection (t) in the grid can be formed by drawing some vertical and horizontal lines through the integer coordinates representing the frame numberi,rj) Indicating the intersection of a frame in the predicted speech feature of the comment and a frame in the dialect speech feature reference template. The DTW algorithm is carried out in two steps, namely, the distance between each frame of the predicted comment speech feature reference template and each frame of the dialect speech feature reference template is calculated, namely, a frame matching distance matrix is obtained, and the optimal path, namely, the target reference template is found in the frame matching distance matrix.
And step 205, calculating a simulation force score of the speech feature to be evaluated according to the target reference template.
In a specific implementation, it is assumed that the target reference template has M frame vectors { R (1), R (2),. ·, R (M),. R.,. R (M)), and R (M)) as the speech feature vector of the mth frame, and the predicted speech feature has N frame vectors { T (1), T (2),. T.,. T (N)), and T (N)) as the speech feature vector of the nth frame. d (T (i)n),R(im) Denotes the i-th in TnFrame characteristics and i in RmThe distance between frame features is typically expressed in terms of euclidean distance. And finally, obtaining the fractional output according to the result quantization of the Euclidean distance. And finally obtaining a polynomial fitting curve function through multiple tests:
SCORE=-0.08*d(T(in),R(im))*d(T(in),R(im))+100
wherein, SCORE is the analog force SCORE, and the analog force SCORE of the voice signal can be calculated by fitting a curve function with the polynomial.
The invention receives the voice signal input by the user; extracting a predicted comment speech feature from the speech signal; matching the preset dialect voice feature reference templates to obtain a target reference template for predicting the comment voice features; and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template. Therefore, an efficient and real-time dialect evaluation result is provided for the user, and the exercise enthusiasm of the user is improved.
For ease of understanding, embodiments of the present invention are described below by way of specific examples.
Referring to fig. 6, fig. 6 is a schematic diagram of an android integration flow of a dialect simulation force evaluation method according to an embodiment of the present invention. The specific implementation steps are as follows:
the operator prepares dialect data sets including template voices of Wuhan dialect, Shandong dialect, northeast dialect, Hakka dialect, Cantonese, Sichuan dialect and the like;
an android client is in butt joint with a DIE (Dialect evaluation) platform, microphone data acquisition is completed through a self-contained voice acquisition frame, a complete sound segment is generated and stored locally, an init interface is called to complete initialization of Dialect evaluation method software, a sampling rate and a frame length are set, a required reference voice path and the like;
calling a processing interface to read the stored sound fragment file and perform similarity calculation with reference voice, wherein the similarity calculation comprises preprocessing, multi-feature extraction and template loading comparison operations on the stored sound fragment, and finally calling a stop interface to obtain the simulation force score calculated by the DTW. After obtaining the result, the client can give out a correspondence according to the difficulty degree of dialect simulation, and if the imitation force is strong and the score is high, the client adopts applause to express appreciation; and for the low-scoring condition, prompting the user to use more by adopting the prompt of multiple contacts, and calling the store interface to release resources to finish the dialect simulation force scene by the rear left user so as to finish the evaluation.
Referring to fig. 7, fig. 7 is a block diagram of a dialect simulation force evaluation apparatus according to an embodiment of the present invention.
The embodiment of the invention provides a dialect simulation force evaluation device which is characterized by comprising the following components:
a voice signal receiving module 701, configured to receive a voice signal input by a user;
a predicted comment sound feature extraction module 702 for extracting a predicted comment sound feature from the speech signal;
a target reference template matching module 703, configured to match a target reference template of the predicted comment speech feature from a preset dialect speech feature reference template;
and the simulated force score calculating module 704 is used for calculating the simulated force score of the speech feature to be evaluated according to the target reference template.
In this embodiment of the present invention, the predictive comment feature extracting module 702 includes:
the preprocessing submodule is used for preprocessing the voice signal to obtain a preprocessed signal;
and the predicted comment sound characteristic extraction sub-module is used for extracting the predicted comment sound characteristic from the preprocessed signal.
In the embodiment of the present invention, the prediction comment sound feature extraction sub-module includes:
the frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
the short-time energy spectrum calculating unit is used for solving a square value of a frequency spectrum to obtain a short-time energy spectrum;
the Mel frequency spectrum calculating unit is used for acquiring the magnitude spectrum of the frequency spectrum and converting the magnitude spectrum into a Mel frequency spectrum;
the logarithm calculation unit is used for calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
and the predicted comment sound characteristic calculation unit is used for performing discrete cosine transform on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
In this embodiment of the present invention, the target reference template matching module 703 includes:
and the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment sound characteristic and the preset dialect sound characteristic reference template and matching the target reference template of the predicted comment sound characteristic in the preset dialect sound characteristic reference template according to the calculation result.
An embodiment of the present invention further provides an electronic device, where the device includes a processor and a memory:
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the dialect modeling force evaluation method of an embodiment of the present invention according to instructions in the program code.
The embodiment of the invention also provides a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the dialect simulation force evaluation method of the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A dialect modeling force assessment method, comprising:
receiving a voice signal input by a user;
extracting a predicted comment speech feature from the speech signal;
matching the preset dialect speech feature reference template to obtain a target reference template of the predicted comment speech feature;
and calculating the simulation force score of the predicted and evaluated voice characteristics according to the target reference template.
2. The method of claim 1, wherein the step of extracting the predicted comment speech feature from the speech signal comprises:
preprocessing the voice signal to obtain a preprocessed signal;
extracting a predicted comment speech feature from the preprocessed signal.
3. The method of claim 2, wherein said step of extracting a predicted comment speech feature from said preprocessed signal comprises:
performing fast Fourier transform on the preprocessed voice signal to obtain a frequency spectrum of the preprocessed voice signal;
solving a square value of the frequency spectrum to obtain a short-time energy spectrum;
obtaining a magnitude spectrum of the frequency spectrum, and converting the magnitude spectrum into a Mel frequency spectrum;
calculating the logarithm of the Mel frequency spectrum according to the short-time energy spectrum and the Mel frequency spectrum;
and performing discrete cosine transform on the logarithm, calculating to obtain a Mel frequency cepstrum coefficient, and taking the Mel frequency cepstrum coefficient as the predicted comment voice feature of the preprocessed voice signal.
4. The method of claim 1, wherein the step of matching the target reference template of the predicted comment speech feature from the reference templates of the speech features of the predetermined dialect comprises:
and performing dynamic time warping calculation on the predicted comment sound characteristic and a preset dialect speech characteristic reference template, and matching a target reference template of the predicted comment sound characteristic in the preset dialect speech characteristic reference template according to a calculation result.
5. A dialect simulation force evaluation device, comprising:
the voice signal receiving module is used for receiving a voice signal input by a user;
the predictive comment sound feature extraction module is used for extracting predictive comment sound features from the voice signal;
the target reference template matching module is used for matching a preset dialect speech feature reference template to obtain a target reference template of the predicted comment speech feature;
and the simulated force score calculating module is used for calculating the simulated force score of the predicted and evaluated voice characteristics according to the target reference template.
6. The apparatus of claim 5, wherein the predictive speech feature extraction module comprises:
the preprocessing submodule is used for preprocessing the voice signal to obtain a preprocessed signal;
and the predicted comment sound characteristic extraction sub-module is used for extracting predicted comment sound characteristics from the preprocessed signal.
7. The apparatus of claim 6, wherein the predictive comment speech feature extraction sub-module comprises:
the frequency spectrum calculation unit is used for carrying out fast Fourier transform on the preprocessed voice signal to obtain the frequency spectrum of the preprocessed voice signal;
the short-time energy spectrum calculating unit is used for solving a square value of the frequency spectrum to obtain a short-time energy spectrum;
the Mel frequency spectrum calculating unit is used for acquiring the magnitude spectrum of the frequency spectrum and converting the magnitude spectrum into a Mel frequency spectrum;
a logarithm calculation unit, configured to calculate a logarithm of the mel-frequency spectrum according to the short-time energy spectrum and the mel-frequency spectrum;
and the predicted comment sound characteristic calculation unit is used for performing discrete cosine transform on the logarithm to obtain a mel frequency cepstrum coefficient through calculation, and taking the mel frequency cepstrum coefficient as the predicted comment sound characteristic of the preprocessed voice signal.
8. The apparatus of claim 5, wherein the target reference template matching module comprises:
and the target reference template matching sub-module is used for carrying out dynamic time warping calculation on the predicted comment sound characteristic and a preset dialect speech characteristic reference template, and matching the target reference template of the predicted comment sound characteristic in the preset dialect speech characteristic reference template according to a calculation result.
9. An electronic device, comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the dialect modeling force assessment method of any of claims 1-4 according to instructions in the program code.
10. A computer-readable storage medium for storing program code for performing the dialect modeling force assessment method of any one of claims 1-4.
CN202110850935.9A 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium Pending CN113571043A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110850935.9A CN113571043A (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110850935.9A CN113571043A (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113571043A true CN113571043A (en) 2021-10-29

Family

ID=78167949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110850935.9A Pending CN113571043A (en) 2021-07-27 2021-07-27 Dialect simulation force evaluation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113571043A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302427A (en) * 1997-11-03 2001-07-04 T-内提克斯公司 Model adaptation system and method for speaker verification
CN101246685A (en) * 2008-03-17 2008-08-20 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息系统有限公司 Shanghai dialect phonetic recognition information processing method
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
JP2015068897A (en) * 2013-09-27 2015-04-13 国立大学法人 東京大学 Evaluation method and device for utterance and computer program for evaluating utterance
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302427A (en) * 1997-11-03 2001-07-04 T-内提克斯公司 Model adaptation system and method for speaker verification
CN101246685A (en) * 2008-03-17 2008-08-20 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息系统有限公司 Shanghai dialect phonetic recognition information processing method
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
JP2015068897A (en) * 2013-09-27 2015-04-13 国立大学法人 東京大学 Evaluation method and device for utterance and computer program for evaluating utterance
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN112233651A (en) * 2020-10-10 2021-01-15 深圳前海微众银行股份有限公司 Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Similar Documents

Publication Publication Date Title
CN106935248B (en) Voice similarity detection method and device
Dhingra et al. Isolated speech recognition using MFCC and DTW
Helander et al. Voice conversion using dynamic kernel partial least squares regression
JP4802135B2 (en) Speaker authentication registration and confirmation method and apparatus
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
CN110364140B (en) Singing voice synthesis model training method, singing voice synthesis model training device, computer equipment and storage medium
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
US20190378532A1 (en) Method and apparatus for dynamic modifying of the timbre of the voice by frequency shift of the formants of a spectral envelope
US10014007B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN112489629A (en) Voice transcription model, method, medium, and electronic device
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN106548785A (en) A kind of method of speech processing and device, terminal unit
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
CN110648655B (en) Voice recognition method, device, system and storage medium
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
AU2014395554B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN109741761B (en) Sound processing method and device
Jokinen et al. Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network
CN114302301B (en) Frequency response correction method and related product
CN113571043A (en) Dialect simulation force evaluation method and device, electronic equipment and storage medium
CN110033786B (en) Gender judgment method, device, equipment and readable storage medium
Akhter et al. An analysis of performance evaluation metrics for voice conversion models
CN113436607A (en) Fast voice cloning method
Lipeika Optimization of formant feature based speech recognition
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination