CN111710348A

CN111710348A - Pronunciation evaluation method and terminal based on audio fingerprints

Info

Publication number: CN111710348A
Application number: CN202010467723.8A
Authority: CN
Inventors: 刘焕玉; 肖龙源; 李稀敏; 刘晓葳; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-25

Abstract

The invention provides a pronunciation evaluation method and a terminal based on audio fingerprints, wherein the method comprises the following steps: collecting standard pronunciation audio and extracting standard pronunciation audio fingerprints corresponding to the standard pronunciation audio; acquiring user pronunciation audio, and extracting user pronunciation audio fingerprints corresponding to the user pronunciation audio; matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint; setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio; by implementing the method, the terminal evaluates the pronunciation of the user, corrects the wrong pronunciation and improves the pronunciation level of the user.

Description

Pronunciation evaluation method and terminal based on audio fingerprints

Technical Field

The invention relates to the technical field of computer-aided teaching, in particular to a pronunciation evaluation method and a pronunciation evaluation terminal based on audio fingerprints.

Background

Along with the gradual improvement of the living standard of people, people pay more and more attention to preschool education of children, Chinese pinyin, Arabic numbers and English letters are the language foundation in preschool education, and the influence of standard pronunciation on the language is of great importance.

At present, the pronunciation teaching process is totally dependent on teachers with different experience levels for explanation demonstration, however, the pronunciation is not standard due to various current situations of Chinese dialects and other uncontrollable factors of artificial teaching, so that the pronunciation of children when learning Chinese pinyin, Arabic numerals and English letters is often not accurate enough.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method and a terminal for evaluating pronunciation based on audio fingerprints to solve the above problems.

The invention provides a pronunciation evaluating method based on audio fingerprints, which comprises the following steps:

collecting standard pronunciation audio and extracting standard pronunciation audio fingerprints corresponding to the standard pronunciation audio;

acquiring user pronunciation audio, and extracting user pronunciation audio fingerprints corresponding to the user pronunciation audio;

matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint;

and setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio.

Further, the method further comprises:

after the pronunciation scoring is carried out on the pronunciation audio of the user, when the pronunciation scoring is unqualified, the standard pronunciation audio is pushed to the user, the learned pronunciation audio is obtained, and the learned pronunciation audio fingerprint corresponding to the learned pronunciation audio is extracted;

and matching the standard pronunciation audio fingerprint with the learned pronunciation audio fingerprint, and further performing pronunciation scoring until the pronunciation scoring is qualified.

Further, extracting the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint comprises:

performing frame windowing on the standard pronunciation audio or the user pronunciation audio, and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;

extracting a local peak point in the spectrogram;

and taking the spectrogram and/or the peak point as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.

Further, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint specifically includes:

grouping the peak points to obtain peak point combinations, and calculating hash values corresponding to the peak point combinations;

and matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.

Further, the calculation process of the hash value corresponding to each peak point combination includes:

the peak point combination comprises an anchor point and N peak points, and a three-dimensional array corresponding to the peak points is created according to the frequency value of the peak points, the frequency value of the anchor point and the time difference between the peak points and the anchor point;

and calculating the hash value of the three-dimensional array to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.

Further, according to the hash value, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint:

and matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio, and counting the number of the hash values which can be matched with the hash values.

Further, the present invention also provides a pronunciation evaluation terminal based on audio fingerprints, wherein the terminal comprises:

the storage module is used for storing a pronunciation scoring program and standard pronunciation audio;

the acquisition module is used for acquiring pronunciation audio of a user;

the extraction module is used for extracting the standard pronunciation audio fingerprint corresponding to the standard pronunciation audio and the user pronunciation audio fingerprint corresponding to the user pronunciation audio;

the evaluation module is used for matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint so as to carry out pronunciation scoring on the user pronunciation audio;

and the audio-visual module is used for playing the collected user pronunciation audio and the corresponding standard pronunciation audio and displaying the pronunciation scoring result.

Further, the extraction module comprises:

the audio conversion module is used for performing frame windowing processing on the standard pronunciation audio or the user pronunciation audio and performing fast Fourier transform on each short-time analysis window to obtain a corresponding spectrogram;

the image processing module is used for extracting a local peak point in the spectrogram;

and the hash value calculation module is used for grouping the peak points to obtain peak point combinations, calculating hash values corresponding to the peak point combinations, taking the spectrogram and/or the peak points as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint.

Further, the evaluation module comprises:

the matching statistic module is used for matching the hash values of all peak points in the audio fingerprint hash table of the standard pronunciation audio with the hash values of all peak points in the audio fingerprint hash table of the user pronunciation audio and counting the number of the hash values which can be matched with each other;

and the matching scoring module is used for setting a scoring threshold value and scoring the pronunciation of the user audio according to the number of the matched hash values.

Further, the evaluation module is further configured to:

and when the pronunciation score is unqualified, pushing the standard pronunciation audio to the audio-visual module for a user to learn the standard pronunciation audio until the pronunciation score is qualified.

According to the pronunciation evaluation method and the pronunciation evaluation terminal based on the audio fingerprints, an audio file is converted into a spectrogram, a peak point is extracted from the spectrogram, the peak point is grouped and a hash value is calculated, the spectrogram and/or the peak point are used as a standard pronunciation audio fingerprint or a user pronunciation audio fingerprint, the standard pronunciation audio fingerprint is matched with the user pronunciation audio fingerprint according to the hash value, if the matching degree reaches a corresponding grading threshold value, pronunciation grading is carried out on the user pronunciation audio, automatic evaluation of user pronunciation is achieved, and a user can intuitively feel whether the pronunciation is qualified or not compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided.

Drawings

Fig. 1 is a flowchart of a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.

Fig. 2 is a flowchart of step S10 and/or step S20 in a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.

Fig. 3 is a schematic illustration of a spectrogram in an embodiment of the present invention.

FIG. 4 is a diagram illustrating peaks of a spectrogram in an embodiment of the present invention.

Fig. 5 is a flowchart of step S30 in a method for evaluating pronunciation based on audio fingerprints according to an embodiment of the present invention.

FIG. 6 is a diagram illustrating grouping of peak points in an embodiment of the invention.

FIG. 7 is a flowchart of a pronunciation assessment method after learning when pronunciation scores fail in an embodiment of the present invention.

Fig. 8 is a schematic block diagram of a pronunciation evaluation terminal based on audio fingerprints according to an embodiment of the present invention.

Fig. 9 is a schematic diagram illustrating the components of an extraction module in a pronunciation evaluation terminal based on audio fingerprints according to an embodiment of the present invention.

Fig. 10 is a schematic composition diagram of an evaluation module in an audio fingerprint-based pronunciation evaluation terminal according to an embodiment of the present invention.

Description of the main elements

100 terminal

110 memory module

120 acquisition module

130 extraction module

131 audio frequency conversion module

132 image processing module

133 hash value calculation module

140 evaluation module

141 matching statistical module

142 matching scoring module

150 audiovisual module

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Referring to fig. 1, the present invention provides a pronunciation evaluation method based on audio fingerprints, which includes the following steps:

and step S10, collecting standard pronunciation audio, and extracting a standard pronunciation audio fingerprint corresponding to the standard pronunciation audio.

In this embodiment, the standard pronunciation audio may be a standard pronunciation audio file issued by an authority person or an organization by connecting to the internet and downloading from a network resource, or may be a standard pronunciation file generated by directly recording a standard pronunciation played by a manual pronunciation and/or other devices through an intelligent terminal.

Further, the standard pronunciation audio file is trimmed and clipped to generate an audio segment, and the audio segment corresponds to corresponding audio content. For example, when the standard pronunciation audio file contains 26 continuous standard pronunciations of English letters A-Z, the standard pronunciation audio file is arranged and cut into standard pronunciation audio segments of a single English letter, and the standard pronunciation audio segments correspond to the single English letter pairwise. The standard pronunciation audio file may also contain only the standard pronunciation of a single english letter.

Further, the standard pronunciation audio file may contain any one or more of standard pronunciation of english letters, chinese pinyin, arabic numerals, etc.

And step S20, acquiring user pronunciation audio, and extracting a user pronunciation audio fingerprint corresponding to the user pronunciation audio.

In this embodiment, extracting the user pronunciation audio fingerprint corresponding to the user pronunciation audio or the standard pronunciation audio fingerprint corresponding to the standard pronunciation audio specifically includes the steps shown in fig. 2:

step S21, performing frame windowing on the standard pronunciation audio or the user pronunciation audio, and performing fast fourier transform on each short-time analysis window to obtain a corresponding spectrogram as shown in fig. 3.

And step S22, extracting a local peak point in the spectrogram.

In this embodiment, the extracting of the local peak point in the spectrogram may be performed by performing image processing on the spectrogram through an OpenCV technique, searching for a local frequency maximum value within a fixed time range, and labeling the peak point as shown in fig. 4.

And step S23, taking the spectrogram and/or the peak point as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint. The present invention preferably employs the peak point as an audio fingerprint.

And step S30, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint.

In this embodiment, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint specifically includes the steps shown in fig. 5:

and step S31, grouping the peak points to obtain peak point combinations, and calculating the hash value corresponding to each peak point combination.

In this embodiment, as shown in fig. 6, the process of calculating the hash value corresponding to each peak point combination includes: grouping N +1 peak points, wherein the peak point combination comprises an anchor point and N peak points, the anchor point is one of the peak points, for example, combining 6 peak points into a group, and reducing the storage amount and the calculation amount of the peak points in a combined manner.

And further, creating a three-dimensional array corresponding to the peak point according to the frequency value of the peak point, the frequency value of the anchor point and the time difference between the peak point and the anchor point. The three-dimensional array contains three pieces of information: the frequency of the anchor point corresponding to the peak point, the frequency of the peak point, and the time difference between the peak point and the corresponding anchor point. And calculating the hash value of the three-dimensional array in a hash calculation mode to serve as the hash value of the peak point, and establishing an audio fingerprint hash table corresponding to the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint. In this embodiment, the hash calculation method may be a secure hash algorithm (SHA-1) or the like.

And step S32, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.

In this embodiment, the step of matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint is to search the hash values of all peak points of the user pronunciation audio in the standard pronunciation audio fingerprint hash table, so as to match the hash values of all peak points in the standard pronunciation audio fingerprint hash table with the hash values of all peak points in the user pronunciation audio fingerprint hash table, and count the number of hash values that can be matched between the standard pronunciation audio fingerprint and the user pronunciation audio fingerprint hash table.

And step S40, setting a scoring threshold, and if the matching degree reaches the corresponding scoring threshold, carrying out pronunciation scoring on the user pronunciation audio.

In this embodiment, a scoring threshold is set, and pronunciation scoring is performed on the user pronunciation audio according to the number of the matched hash values. For example, the standard audio fingerprint hash table includes 100 peak points and hash values corresponding to the peak points, when 80-100 hash values of the peak points of the user pronunciation audio are found through searching and can be matched with the hash values in the standard audio fingerprint hash table, the user pronunciation audio is determined to be qualified, and the rest of the hash values are determined to be unqualified.

When the pronunciation score of the user pronunciation audio is judged to be unqualified, the pronunciation evaluating method based on the audio fingerprint further comprises the following steps as shown in fig. 7:

and step S50, after the pronunciation score is carried out on the pronunciation audio of the user, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, the learned pronunciation audio is obtained, and the learned pronunciation audio fingerprint corresponding to the learned pronunciation audio is extracted.

And step S60, matching the standard pronunciation audio fingerprint with the learned pronunciation audio fingerprint, and further performing pronunciation scoring until the pronunciation scoring is qualified.

The invention provides a pronunciation evaluating method based on audio fingerprints, which comprises the steps of converting an audio file into a spectrogram, extracting a peak point from the spectrogram, grouping the peak point and calculating a hash value, taking the spectrogram and/or the peak point as a standard pronunciation audio fingerprint or a user pronunciation audio fingerprint, matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value, and if the matching degree reaches a corresponding grading threshold value, grading the pronunciation of the user pronunciation, so that the automatic evaluation of the user pronunciation is realized, and the user can intuitively feel whether the pronunciation is qualified compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided.

Referring to fig. 8, as an implementation of the methods shown in the above diagrams, the present invention provides a pronunciation evaluation terminal 100 based on audio fingerprints, where the terminal 100 includes a storage module 110, a collection module 120, an extraction module 130, an evaluation module 140, and an audiovisual module 150. Fig. 8 shows only some of the modules of the terminal 100, but it is to be understood that not all of the shown modules are required to be implemented, and more or fewer modules may be implemented instead.

In this embodiment, the terminal 100 may be implemented in various forms, such as a mobile terminal, e.g., a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a smart watch, and a fixed terminal, e.g., a digital television, a desktop computer, and the like.

The storage module 110 is configured to store a pronunciation scoring program and standard pronunciation audio.

In this embodiment, the storage module 110 may be an internal storage unit of the terminal 100, such as a hard disk or a memory of a mobile phone, an external storage device of the terminal, such as a plug-in hard disk, a smart card, a secure digital card, a flash memory card, and the like, and may include both the internal storage unit and the external storage device.

The acquisition module 120 is configured to acquire a user pronunciation audio.

In this embodiment, the collecting module 120 may be further configured to collect the standard pronunciation audio, directly record the standard pronunciation played by the manual standard pronunciation and/or other devices, and generate a standard pronunciation audio file.

The extracting module 130 is configured to extract a standard pronunciation audio fingerprint corresponding to the standard pronunciation audio and a user pronunciation audio fingerprint corresponding to the user pronunciation audio.

In this embodiment, as shown in fig. 9, the extracting module 130 further includes an audio conversion module 131, an image processing module 132, and a hash value calculating module 133. Wherein:

the audio conversion module 131 is configured to perform frame windowing on the standard pronunciation audio or the user pronunciation audio, and perform fast fourier transform on each short-time analysis window to obtain a corresponding spectrogram.

The image processing module 132 is configured to extract a local peak point in the spectrogram.

In this embodiment, extracting the local peak point in the spectrogram may be performing image processing on the spectrogram through an OpenCV technique, and searching for a local frequency maximum value within a fixed time range.

The hash value calculating module 133 is configured to group the peak points to obtain peak point combinations, and calculate a hash value corresponding to each peak point combination. The calculation process of the hash value corresponding to each peak point combination comprises the following steps: grouping N +1 peak points, wherein the peak point combination comprises an anchor point and N peak points, the anchor point is one of the peak points, for example, combining 6 peak points into a group, and reducing the storage amount and the calculation amount of the peak points in a combined manner.

The evaluating module 140 is configured to match the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint, so as to perform pronunciation scoring on the user pronunciation audio.

In this embodiment, as shown in fig. 10, the evaluating module 140 includes a matching statistic module 141 and a matching scoring module 142. Wherein:

the matching statistic module 141 is configured to match the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value.

The matching scoring module 142 is configured to set a scoring threshold, and perform pronunciation scoring on the user pronunciation audio if the matching degree reaches the corresponding scoring threshold.

The audio-visual module 150 is configured to play the collected user pronunciation audio and the corresponding standard pronunciation audio, and display a pronunciation scoring result.

According to the pronunciation scoring terminal based on the audio fingerprints, provided by the invention, an audio file is converted into a spectrogram, a peak point is extracted from the spectrogram, the peak point is grouped and a hash value is calculated, the spectrogram and/or the peak point are/is used as the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint, the standard pronunciation audio fingerprint is matched with the user pronunciation audio fingerprint according to the hash value, if the matching degree reaches a corresponding scoring threshold value, pronunciation scoring is carried out on the user pronunciation audio, the automatic user pronunciation evaluation is realized, and a user can intuitively feel whether the pronunciation is qualified or not compared with the standard pronunciation; in addition, when the pronunciation score is unqualified, the standard pronunciation audio is pushed to the user, and after the user learns the standard pronunciation audio, the pronunciation score is carried out again until the standard pronunciation audio is qualified, so that the pronunciation correction exercise is automatically guided to the user, the automation and the accuracy of pronunciation teaching are improved, and the phenomenon that the pronunciation is not standard due to uncontrollable factors of artificial teaching is avoided. .

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit of the technical solutions of the present invention.

Claims

1. A pronunciation evaluation method based on audio fingerprints is characterized by comprising the following steps:

2. The method for evaluating pronunciation based on an audio fingerprint according to claim 1, wherein the method further comprises:

3. The method for evaluating pronunciation according to claim 1 or 2, wherein extracting the standard pronunciation audio fingerprint or the user pronunciation audio fingerprint comprises:

extracting a local peak point in the spectrogram;

4. The method for evaluating pronunciation according to claim 3, wherein the matching of the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint comprises:

5. The method for evaluating pronunciation according to claim 4, wherein the calculating of the hash value corresponding to each peak point combination comprises:

6. The method for evaluating pronunciation according to claim 5 wherein matching the standard pronunciation audio fingerprint with the user pronunciation audio fingerprint according to the hash value comprises:

7. A pronunciation evaluation terminal based on audio fingerprints is characterized in that the terminal comprises:

the acquisition module is used for acquiring pronunciation audio of a user;

8. The pronunciation evaluation terminal as claimed in claim 7, wherein the extracting module comprises:

9. The pronunciation evaluation terminal as claimed in claim 7, wherein the evaluation module comprises:

10. The pronunciation evaluation terminal according to any one of claims 7-9, wherein the evaluation module is further configured to: