CN111710348A - Pronunciation evaluation method and terminal based on audio fingerprints - Google Patents

Pronunciation evaluation method and terminal based on audio fingerprints Download PDF

Info

Publication number
CN111710348A
CN111710348A CN202010467723.8A CN202010467723A CN111710348A CN 111710348 A CN111710348 A CN 111710348A CN 202010467723 A CN202010467723 A CN 202010467723A CN 111710348 A CN111710348 A CN 111710348A
Authority
CN
China
Prior art keywords
pronunciation
audio
user
standard
pronunciation audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010467723.8A
Other languages
Chinese (zh)
Inventor
刘焕玉
肖龙源
李稀敏
刘晓葳
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010467723.8A priority Critical patent/CN111710348A/en
Publication of CN111710348A publication Critical patent/CN111710348A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/04Electrically-operated educational appliances with audible presentation of the material to be studied
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a pronunciation evaluation method and a terminal based on audio fingerprints, wherein the method comprises the following steps: collecting standard pronunciation audio and extracting standard pronunciation audio fingerprints corresponding to the standard pronunciation audio; acquiring user pronunciation audio, and extracting user pronunciation audio fingerprints corresponding to the user pronunciation audio; matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint; setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio; by implementing the method, the terminal evaluates the pronunciation of the user, corrects the wrong pronunciation and improves the pronunciation level of the user.

Description

Pronunciation evaluation method and terminal based on audio fingerprints
Technical Field
The invention relates to the technical field of computer-aided teaching, in particular to a pronunciation evaluation method and a pronunciation evaluation terminal based on audio fingerprints.
Background
Along with the gradual improvement of the living standard of people, people pay more and more attention to preschool education of children, Chinese pinyin, Arabic numbers and English letters are the language foundation in preschool education, and the influence of standard pronunciation on the language is of great importance.
At present, the pronunciation teaching process is totally dependent on teachers with different experience levels for explanation demonstration, however, the pronunciation is not standard due to various current situations of Chinese dialects and other uncontrollable factors of artificial teaching, so that the pronunciation of children when learning Chinese pinyin, Arabic numerals and English letters is often not accurate enough.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method and a terminal for evaluating pronunciation based on audio fingerprints to solve the above problems.
The invention provides a pronunciation evaluating method based on audio fingerprints, which comprises the following steps:
collecting standard pronunciation audio and extracting standard pronunciation audio fingerprints corresponding to the standard pronunciation audio;
acquiring user pronunciation audio, and extracting user pronunciation audio fingerprints corresponding to the user pronunciation audio;
matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint;
and setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio.
Further, the method further comprises:
after the pronunciation scoring is carried out on the pronunciation audio of the user, when the pronunciation scoring is unqualified, the standard pronunciation audio is pushed to the user, the learned pronunciation audio is obtained, and the learned pronunciation audio fingerprint corresponding to the learned pronunciation audio is extracted;
and matching the standard pronunciation audio fingerprint with the learned pronunciation audio fingerprint, and further performing pronunciation scoring until the pronunciation scoring is qualified.
Further, extracting the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint comprises:
performing frame windowing on the standard pronunciation audio or the user pronunciation audio, and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;
extracting a local peak point in the spectrogram;
and taking the spectrogram and/or the peak point as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
Further, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint specifically includes:
grouping the peak points to obtain peak point combinations, and calculating hash values corresponding to the peak point combinations;
and matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.
Further, the calculation process of the hash value corresponding to each peak point combination includes:
the peak point combination comprises an anchor point and N peak points, and a three-dimensional array corresponding to the peak points is created according to the frequency value of the peak points, the frequency value of the anchor point and the time difference between the peak points and the anchor point;
and calculating the hash value of the three-dimensional array to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
Further, according to the hash value, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint:
and matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio, and counting the number of the hash values which can be matched with the hash values.
Further, the present invention also provides a pronunciation evaluation terminal based on audio fingerprints, wherein the terminal comprises:
the storage module is used for storing a pronunciation scoring program and standard pronunciation audio;
the acquisition module is used for acquiring pronunciation audio of a user;
the extraction module is used for extracting the standard pronunciation audio fingerprint corresponding to the standard pronunciation audio and the user pronunciation audio fingerprint corresponding to the user pronunciation audio;
the evaluation module is used for matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint so as to carry out pronunciation scoring on the user pronunciation audio;
and the audio-visual module is used for playing the collected user pronunciation audio and the corresponding standard pronunciation audio and displaying the pronunciation scoring result.
Further, the extraction module comprises:
the audio conversion module is used for performing frame windowing processing on the standard pronunciation audio or the user pronunciation audio and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;
the image processing module is used for extracting a local peak point in the spectrogram;
and the hash value calculation module is used for grouping the peak points to obtain peak point combinations, calculating hash values corresponding to the peak point combinations, taking the spectrogram and/or the peak points as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
Further, the evaluation module comprises:
the matching statistic module is used for matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio and counting the number of the hash values which can be matched with each other;
and the matching scoring module is used for setting a scoring threshold value and scoring the pronunciation of the user audio according to the number of the matched hash values.
Further, the evaluation module is further configured to:
and when the pronunciation score is unqualified, pushing the standard pronunciation audio to the audio-visual module for a user to learn the standard pronunciation audio until the pronunciation score is qualified.
According to the pronunciation evaluation method and the pronunciation evaluation terminal based on the audio fingerprints, an audio file is converted into a spectrogram, a peak point is extracted from the spectrogram, the peak point is grouped and a hash value is calculated, the spectrogram and/or the peak point are used as a standard pronunciation audio fingerprint or a user pronunciation audio fingerprint, the standard pronunciation audio fingerprint is matched with the user pronunciation audio fingerprint according to the hash value, if the matching degree reaches a corresponding grading threshold value, pronunciation grading is carried out on the user pronunciation audio, automatic evaluation of user pronunciation is achieved, and a user can intuitively feel whether the pronunciation is qualified or not compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided.
Drawings
Fig. 1 is a flowchart of a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.
Fig. 2 is a flowchart of step S10 and/or step S20 in a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.
Fig. 3 is a schematic illustration of a spectrogram in an embodiment of the present invention.
FIG. 4 is a diagram illustrating peaks of a spectrogram in an embodiment of the present invention.
Fig. 5 is a flowchart of step S30 in a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating grouping of peak points in an embodiment of the invention.
FIG. 7 is a flowchart of a pronunciation assessment method after learning when pronunciation scores fail in an embodiment of the present invention.
Fig. 8 is a schematic block diagram of a pronunciation evaluation terminal based on audio fingerprints according to an embodiment of the present invention.
Fig. 9 is a schematic diagram illustrating the components of an extraction module in a pronunciation evaluation terminal based on audio fingerprints according to an embodiment of the present invention.
Fig. 10 is a schematic composition diagram of an evaluation module in an audio fingerprint-based pronunciation evaluation terminal according to an embodiment of the present invention.
Description of the main elements
100 terminal
110 memory module
120 acquisition module
130 extraction module
131 audio frequency conversion module
132 image processing module
133 hash value calculation module
140 evaluation module
141 matching statistical module
142 matching scoring module
150 audiovisual module
The following detailed description will further illustrate the invention in conjunction with the above-described figures.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Referring to fig. 1, the present invention provides a pronunciation evaluation method based on audio fingerprints, which includes the following steps:
and step S10, collecting standard pronunciation audio, and extracting a standard pronunciation audio fingerprint corresponding to the standard pronunciation audio.
In this embodiment, the standard pronunciation audio may be a standard pronunciation audio file issued by an authority person or an organization by connecting to the internet and downloading from a network resource, or may be a standard pronunciation file generated by directly recording a standard pronunciation played by a manual pronunciation and/or other devices through an intelligent terminal.
Further, the standard pronunciation audio file is trimmed and clipped to generate an audio segment, and the audio segment corresponds to corresponding audio content. For example, when the standard pronunciation audio file contains 26 continuous standard pronunciations of English letters A-Z, the standard pronunciation audio file is arranged and cut into standard pronunciation audio segments of a single English letter, and the standard pronunciation audio segments correspond to the single English letter pairwise. The standard pronunciation audio file may also contain only the standard pronunciation of a single english letter.
Further, the standard pronunciation audio file may contain any one or more of standard pronunciation of english letters, chinese pinyin, arabic numerals, etc.
And step S20, acquiring user pronunciation audio, and extracting a user pronunciation audio fingerprint corresponding to the user pronunciation audio.
In this embodiment, extracting the user pronunciation audio fingerprint corresponding to the user pronunciation audio or the standard pronunciation audio fingerprint corresponding to the standard pronunciation audio specifically includes the steps shown in fig. 2:
step S21, performing frame windowing on the standard pronunciation audio or the user pronunciation audio, and performing fast fourier transform on each short-time analysis window to obtain a corresponding spectrogram as shown in fig. 3.
And step S22, extracting a local peak point in the spectrogram.
In this embodiment, the extracting of the local peak point in the spectrogram may be performed by performing image processing on the spectrogram through an OpenCV technique, searching for a local frequency maximum value within a fixed time range, and labeling the peak point as shown in fig. 4.
And step S23, taking the spectrogram and/or the peak point as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint. The present invention preferably employs the peak point as an audio fingerprint.
And step S30, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint.
In this embodiment, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint specifically includes the steps shown in fig. 5:
and step S31, grouping the peak points to obtain peak point combinations, and calculating the hash value corresponding to each peak point combination.
In this embodiment, as shown in fig. 6, the process of calculating the hash value corresponding to each peak point combination includes: grouping N +1 peak points, wherein the peak point combination comprises an anchor point and N peak points, the anchor point is one of the peak points, for example, combining 6 peak points into a group, and reducing the storage amount and the calculation amount of the peak points in a combined manner.
And further, creating a three-dimensional array corresponding to the peak point according to the frequency value of the peak point, the frequency value of the anchor point and the time difference between the peak point and the anchor point. The three-dimensional array contains three pieces of information: the frequency of the anchor point corresponding to the peak point, the frequency of the peak point, and the time difference between the peak point and the corresponding anchor point. And calculating the hash value of the three-dimensional array in a hash calculation mode to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint. In this embodiment, the hash calculation method may be a secure hash algorithm (SHA-1) or the like.
And step S32, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.
In this embodiment, the step of matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint is to search the hash values of all peak points of the user pronunciation audio in the standard pronunciation audio fingerprint hash table, so as to match the hash values of all peak points in the standard pronunciation audio fingerprint hash table with the hash values of all peak points in the user pronunciation audio fingerprint hash table, and count the number of hash values that can be matched between the standard pronunciation audio fingerprint and the user pronunciation audio fingerprint hash table.
And step S40, setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio.
In this embodiment, a scoring threshold is set, and pronunciation scoring is performed on the user pronunciation audio according to the number of the matched hash values. For example, the standard audio fingerprint hash table includes 100 peak points and hash values corresponding to the peak points, when 80-100 hash values of the peak points of the user pronunciation audio are found through searching and can be matched with the hash values in the standard audio fingerprint hash table, the user pronunciation audio is determined to be qualified, and the rest of the hash values are determined to be unqualified.
When the pronunciation score of the user pronunciation audio is judged to be unqualified, the pronunciation evaluating method based on the audio fingerprint further comprises the following steps as shown in fig. 7:
and step S50, after the pronunciation score is carried out on the pronunciation audio of the user, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, the learned pronunciation audio is obtained, and the learned pronunciation audio fingerprint corresponding to the learned pronunciation audio is extracted.
And step S60, matching the standard pronunciation audio fingerprint with the learned pronunciation audio fingerprint, and further performing pronunciation scoring until the pronunciation scoring is qualified.
The invention provides a pronunciation evaluating method based on audio fingerprints, which comprises the steps of converting an audio file into a spectrogram, extracting a peak point from the spectrogram, grouping the peak point and calculating a hash value, taking the spectrogram and/or the peak point as a standard pronunciation audio fingerprint or a user pronunciation audio fingerprint, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value, and if the matching degree reaches a corresponding grading threshold value, grading the pronunciation of the user pronunciation, so that the automatic evaluation of the user pronunciation is realized, and the user can intuitively feel whether the pronunciation is qualified compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided.
Referring to fig. 8, as an implementation of the methods shown in the above diagrams, the present invention provides a pronunciation evaluation terminal 100 based on audio fingerprints, where the terminal 100 includes a storage module 110, a collection module 120, an extraction module 130, an evaluation module 140, and an audiovisual module 150. Fig. 8 shows only some of the modules of the terminal 100, but it is to be understood that not all of the shown modules are required to be implemented, and more or fewer modules may be implemented instead.
In this embodiment, the terminal 100 may be implemented in various forms, such as a mobile terminal, e.g., a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a smart watch, and a fixed terminal, e.g., a digital television, a desktop computer, and the like.
The storage module 110 is configured to store a pronunciation scoring program and standard pronunciation audio.
In this embodiment, the storage module 110 may be an internal storage unit of the terminal 100, such as a hard disk or a memory of a mobile phone, an external storage device of the terminal, such as a plug-in hard disk, a smart card, a secure digital card, a flash memory card, and the like, and may include both the internal storage unit and the external storage device.
The acquisition module 120 is configured to acquire a user pronunciation audio.
In this embodiment, the collecting module 120 may be further configured to collect the standard pronunciation audio, directly record the standard pronunciation played by the manual standard pronunciation and/or other devices, and generate a standard pronunciation audio file.
The extracting module 130 is configured to extract a standard pronunciation audio fingerprint corresponding to the standard pronunciation audio and a user pronunciation audio fingerprint corresponding to the user pronunciation audio.
In this embodiment, as shown in fig. 9, the extracting module 130 further includes an audio conversion module 131, an image processing module 132, and a hash value calculating module 133. Wherein:
the audio conversion module 131 is configured to perform frame windowing on the standard pronunciation audio or the user pronunciation audio, and perform fast fourier transform on each short-time analysis window to obtain a corresponding spectrogram.
The image processing module 132 is configured to extract a local peak point in the spectrogram.
In this embodiment, extracting the local peak point in the spectrogram may be performing image processing on the spectrogram through an OpenCV technique, and searching for a local frequency maximum value within a fixed time range.
The hash value calculating module 133 is configured to group the peak points to obtain peak point combinations, and calculate a hash value corresponding to each peak point combination. The calculation process of the hash value corresponding to each peak point combination comprises the following steps: grouping N +1 peak points, wherein the peak point combination comprises an anchor point and N peak points, the anchor point is one of the peak points, for example, combining 6 peak points into a group, and reducing the storage amount and the calculation amount of the peak points in a combined manner.
And further, creating a three-dimensional array corresponding to the peak point according to the frequency value of the peak point, the frequency value of the anchor point and the time difference between the peak point and the anchor point. The three-dimensional array contains three pieces of information: the frequency of the anchor point corresponding to the peak point, the frequency of the peak point, and the time difference between the peak point and the corresponding anchor point. And calculating the hash value of the three-dimensional array in a hash calculation mode to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint. In this embodiment, the hash calculation method may be a secure hash algorithm (SHA-1) or the like.
The evaluating module 140 is configured to match the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint, so as to perform pronunciation scoring on the user pronunciation audio.
In this embodiment, as shown in fig. 10, the evaluating module 140 includes a matching statistic module 141 and a matching scoring module 142. Wherein:
the matching statistic module 141 is configured to match the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.
In this embodiment, the step of matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint is to search the hash values of all peak points of the user pronunciation audio in the standard pronunciation audio fingerprint hash table, so as to match the hash values of all peak points in the standard pronunciation audio fingerprint hash table with the hash values of all peak points in the user pronunciation audio fingerprint hash table, and count the number of hash values that can be matched between the standard pronunciation audio fingerprint and the user pronunciation audio fingerprint hash table.
The matching scoring module 142 is configured to set a scoring threshold, and perform pronunciation scoring on the user pronunciation audio if the matching degree reaches the corresponding scoring threshold.
In this embodiment, a scoring threshold is set, and pronunciation scoring is performed on the user pronunciation audio according to the number of the matched hash values. For example, the standard audio fingerprint hash table includes 100 peak points and hash values corresponding to the peak points, when 80-100 hash values of the peak points of the user pronunciation audio are found through searching and can be matched with the hash values in the standard audio fingerprint hash table, the user pronunciation audio is determined to be qualified, and the rest of the hash values are determined to be unqualified.
The audio-visual module 150 is configured to play the collected user pronunciation audio and the corresponding standard pronunciation audio, and display a pronunciation scoring result.
According to the pronunciation scoring terminal based on the audio fingerprints, provided by the invention, an audio file is converted into a spectrogram, a peak point is extracted from the spectrogram, the peak point is grouped and a hash value is calculated, the spectrogram and/or the peak point are/is used as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint, the standard pronunciation audio fingerprint is matched with the user pronunciation audio fingerprint according to the hash value, if the matching degree reaches a corresponding scoring threshold value, pronunciation scoring is carried out on the user pronunciation audio, the automatic user pronunciation evaluation is realized, and a user can intuitively feel whether the pronunciation is qualified or not compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided. .
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit of the technical solutions of the present invention.

Claims (10)

1. A pronunciation evaluation method based on audio fingerprints is characterized by comprising the following steps:
collecting standard pronunciation audio and extracting standard pronunciation audio fingerprints corresponding to the standard pronunciation audio;
acquiring user pronunciation audio, and extracting user pronunciation audio fingerprints corresponding to the user pronunciation audio;
matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint;
and setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio.
2. The method for evaluating pronunciation based on an audio fingerprint according to claim 1, wherein the method further comprises:
after the pronunciation scoring is carried out on the pronunciation audio of the user, when the pronunciation scoring is unqualified, the standard pronunciation audio is pushed to the user, the learned pronunciation audio is obtained, and the learned pronunciation audio fingerprint corresponding to the learned pronunciation audio is extracted;
and matching the standard pronunciation audio fingerprint with the learned pronunciation audio fingerprint, and further performing pronunciation scoring until the pronunciation scoring is qualified.
3. The method for evaluating pronunciation according to claim 1 or 2, wherein extracting the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint comprises:
performing frame windowing on the standard pronunciation audio or the user pronunciation audio, and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;
extracting a local peak point in the spectrogram;
and taking the spectrogram and/or the peak point as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
4. The method for evaluating pronunciation according to claim 3, wherein the matching of the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint comprises:
grouping the peak points to obtain peak point combinations, and calculating hash values corresponding to the peak point combinations;
and matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.
5. The method for evaluating pronunciation according to claim 4, wherein the calculating of the hash value corresponding to each peak point combination comprises:
the peak point combination comprises an anchor point and N peak points, and a three-dimensional array corresponding to the peak points is created according to the frequency value of the peak points, the frequency value of the anchor point and the time difference between the peak points and the anchor point;
and calculating the hash value of the three-dimensional array to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
6. The method for evaluating pronunciation according to claim 5 wherein matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value comprises:
and matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio, and counting the number of the hash values which can be matched with the hash values.
7. A pronunciation evaluation terminal based on audio fingerprints is characterized in that the terminal comprises:
the storage module is used for storing a pronunciation scoring program and standard pronunciation audio;
the acquisition module is used for acquiring pronunciation audio of a user;
the extraction module is used for extracting the standard pronunciation audio fingerprint corresponding to the standard pronunciation audio and the user pronunciation audio fingerprint corresponding to the user pronunciation audio;
the evaluation module is used for matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint so as to carry out pronunciation scoring on the user pronunciation audio;
and the audio-visual module is used for playing the collected user pronunciation audio and the corresponding standard pronunciation audio and displaying the pronunciation scoring result.
8. The pronunciation evaluation terminal as claimed in claim 7, wherein the extracting module comprises:
the audio conversion module is used for performing frame windowing processing on the standard pronunciation audio or the user pronunciation audio and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;
the image processing module is used for extracting a local peak point in the spectrogram;
and the hash value calculation module is used for grouping the peak points to obtain peak point combinations, calculating hash values corresponding to the peak point combinations, taking the spectrogram and/or the peak points as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.
9. The pronunciation evaluation terminal as claimed in claim 7, wherein the evaluation module comprises:
the matching statistic module is used for matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio and counting the number of the hash values which can be matched with each other;
and the matching scoring module is used for setting a scoring threshold value and scoring the pronunciation of the user audio according to the number of the matched hash values.
10. The pronunciation evaluation terminal according to any one of claims 7-9, wherein the evaluation module is further configured to:
and when the pronunciation score is unqualified, pushing the standard pronunciation audio to the audio-visual module for a user to learn the standard pronunciation audio until the pronunciation score is qualified.
CN202010467723.8A 2020-05-28 2020-05-28 Pronunciation evaluation method and terminal based on audio fingerprints Pending CN111710348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010467723.8A CN111710348A (en) 2020-05-28 2020-05-28 Pronunciation evaluation method and terminal based on audio fingerprints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010467723.8A CN111710348A (en) 2020-05-28 2020-05-28 Pronunciation evaluation method and terminal based on audio fingerprints

Publications (1)

Publication Number Publication Date
CN111710348A true CN111710348A (en) 2020-09-25

Family

ID=72538459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010467723.8A Pending CN111710348A (en) 2020-05-28 2020-05-28 Pronunciation evaluation method and terminal based on audio fingerprints

Country Status (1)

Country Link
CN (1) CN111710348A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214635A (en) * 2020-10-23 2021-01-12 昆明理工大学 Fast audio retrieval method based on cepstrum analysis
CN112233693A (en) * 2020-10-14 2021-01-15 腾讯音乐娱乐科技(深圳)有限公司 Sound quality evaluation method, device and equipment
CN113782055A (en) * 2021-07-15 2021-12-10 北京墨闻教育科技有限公司 Student characteristic-based voice evaluation method and system
CN117219125A (en) * 2023-11-07 2023-12-12 青岛科技大学 Marine mammal sound signal imitation hidden scoring method based on audio fingerprint

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456346A (en) * 2010-10-19 2012-05-16 盛乐信息技术(上海)有限公司 Concatenated speech detection system and method
CN107886968A (en) * 2017-12-28 2018-04-06 广州讯飞易听说网络科技有限公司 Speech evaluating method and system
CN108961856A (en) * 2018-07-19 2018-12-07 深圳乐几科技有限公司 Verbal learning method and apparatus
CN110602303A (en) * 2019-08-30 2019-12-20 厦门快商通科技股份有限公司 Method and system for preventing telecommunication fraud based on audio fingerprint technology
CN111161758A (en) * 2019-12-04 2020-05-15 厦门快商通科技股份有限公司 Song listening and song recognition method and system based on audio fingerprint and audio equipment
CN111199750A (en) * 2019-12-18 2020-05-26 北京葡萄智学科技有限公司 Pronunciation evaluation method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456346A (en) * 2010-10-19 2012-05-16 盛乐信息技术(上海)有限公司 Concatenated speech detection system and method
CN107886968A (en) * 2017-12-28 2018-04-06 广州讯飞易听说网络科技有限公司 Speech evaluating method and system
CN108961856A (en) * 2018-07-19 2018-12-07 深圳乐几科技有限公司 Verbal learning method and apparatus
CN110602303A (en) * 2019-08-30 2019-12-20 厦门快商通科技股份有限公司 Method and system for preventing telecommunication fraud based on audio fingerprint technology
CN111161758A (en) * 2019-12-04 2020-05-15 厦门快商通科技股份有限公司 Song listening and song recognition method and system based on audio fingerprint and audio equipment
CN111199750A (en) * 2019-12-18 2020-05-26 北京葡萄智学科技有限公司 Pronunciation evaluation method and device, electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233693A (en) * 2020-10-14 2021-01-15 腾讯音乐娱乐科技(深圳)有限公司 Sound quality evaluation method, device and equipment
WO2022078164A1 (en) * 2020-10-14 2022-04-21 腾讯音乐娱乐科技(深圳)有限公司 Sound quality evaluation method and apparatus, and device
CN112233693B (en) * 2020-10-14 2023-12-01 腾讯音乐娱乐科技(深圳)有限公司 Sound quality evaluation method, device and equipment
CN112214635A (en) * 2020-10-23 2021-01-12 昆明理工大学 Fast audio retrieval method based on cepstrum analysis
CN113782055A (en) * 2021-07-15 2021-12-10 北京墨闻教育科技有限公司 Student characteristic-based voice evaluation method and system
CN117219125A (en) * 2023-11-07 2023-12-12 青岛科技大学 Marine mammal sound signal imitation hidden scoring method based on audio fingerprint
CN117219125B (en) * 2023-11-07 2024-01-30 青岛科技大学 Marine mammal sound signal imitation hidden scoring method based on audio fingerprint

Similar Documents

Publication Publication Date Title
CN111710348A (en) Pronunciation evaluation method and terminal based on audio fingerprints
CN110782921B (en) Voice evaluation method and device, storage medium and electronic device
US20200286396A1 (en) Following teaching system having voice evaluation function
CN110706536B (en) Voice answering method and device
CN100514446C (en) Pronunciation evaluating method based on voice identification and voice analysis
CN111462553B (en) Language learning method and system based on video dubbing and sound correction training
US10089898B2 (en) Information processing device, control method therefor, and computer program
CN111107442B (en) Method and device for acquiring audio and video files, server and storage medium
CN109462603A (en) Voiceprint authentication method, equipment, storage medium and device based on blind Detecting
CN111951629A (en) Pronunciation correction system, method, medium and computing device
CN112287175A (en) Method and system for predicting highlight segments of video
CN111915940A (en) Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN113254708A (en) Video searching method and device, computer equipment and storage medium
CN112184503A (en) Children multinomial ability scoring method and system for preschool education quality evaluation
CN111966839B (en) Data processing method, device, electronic equipment and computer storage medium
JP2013088552A (en) Pronunciation training device
CN113657509A (en) Teaching training improving method and device, terminal and storage medium
US10971148B2 (en) Information providing device, information providing method, and recording medium for presenting words extracted from different word groups
CN110046354B (en) Recitation guiding method, apparatus, device and storage medium
KR102170844B1 (en) Lecture voice file text conversion system based on lecture-related keywords
CN110288977B (en) Data processing method and device and electronic equipment
CN113409774A (en) Voice recognition method and device and electronic equipment
CN112164262A (en) Intelligent paper reading tutoring system
CN111078992A (en) Dictation content generation method and electronic equipment
CN112669181B (en) Assessment method for education practice training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200925