CN114170997A

CN114170997A - Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment

Info

Publication number: CN114170997A
Application number: CN202111620731.2A
Authority: CN
Inventors: 李芳足; 吴奎; 金海�; 李�浩; 盛志超; 竺博
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-03-11

Abstract

A pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device are provided. The method comprises the steps of obtaining a text to be detected, and converting the text to be detected into a corresponding phoneme sequence; acquiring a to-be-detected audio obtained by a speaker speaking a to-be-detected text, and extracting acoustic characteristics of the to-be-detected audio; inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result; the first detection result is used for representing whether the text to be detected needs to be spoken by adopting pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by adopting pronunciation skills. The accuracy of pronunciation skill detection can be improved.

Description

Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of voice recognition, in particular to a pronunciation skill detection method, a pronunciation skill detection device, a pronunciation skill storage medium and electronic equipment.

Background

At present, for any language, whether Chinese or English, spoken language is the most important to master the languages. For example, for english learners, spoken language pronunciation is often the key point and the weak point in the english learning process, and whether the pronunciation skills such as continuous reading, blasting loss, turbidification and the like are accurately adopted reflects the spoken language ability of the english learners. In the related art, the pronunciation capability of the speaker is usually detected by means of artificial hearing detection, however, factors such as artificial subjective judgment and hearing fatigue affect the accuracy of the pronunciation skill detection result.

Disclosure of Invention

The application provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device, which can improve the accuracy of pronunciation skill detection.

The pronunciation skill detection method provided by the application comprises the following steps:

acquiring a text to be detected, and converting the text to be detected into a corresponding phoneme sequence;

acquiring a to-be-detected audio obtained by a speaker speaking a to-be-detected text, and extracting acoustic characteristics of the to-be-detected audio;

inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result;

the first detection result is used for representing whether the text to be detected needs to be spoken by adopting pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by adopting pronunciation skills.

The application provides a pronunciation skill detection device, includes:

the first acquisition module is used for acquiring a text to be detected and converting the text to be detected into a corresponding phoneme sequence;

the second acquisition module is used for acquiring the audio to be detected obtained by the speaker speaking the text to be detected and extracting the acoustic characteristics of the audio to be detected;

the detection module is used for inputting the phoneme sequence and the acoustic features into the trained pronunciation skill detection model to carry out pronunciation skill detection processing so as to obtain a first detection result and a second detection result;

The present application provides a storage medium having stored thereon a computer program which, when loaded by a processor, performs the steps in the pronunciation technique detection method as provided herein.

The electronic equipment provided by the application comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps in the pronunciation skill detection method provided by the application by loading the computer program.

According to the method and the device, a text to be detected is obtained, a speaker speaks the text to be detected to obtain audio to be detected, the audio to be detected and the text to be detected are used for detecting the pronunciation of the speaker, the text to be detected is converted into a corresponding phoneme sequence, the acoustic features of the audio to be detected are extracted, the phoneme sequence and the acoustic features are input into a trained pronunciation skill detection model to be subjected to pronunciation skill detection processing, and a first detection result and a second detection result are obtained. The first detection result is used for representing whether the text to be detected needs to be spoken by adopting pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by adopting pronunciation skills. Compared with the prior art, the pronunciation skill detection mode based on artificial intelligence is adopted to replace traditional artificial hearing detection, so that artificial subjective judgment and auditory fatigue can be avoided, and accuracy of pronunciation skill detection is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a pronunciation skill detection system according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a pronunciation skill detection method according to an embodiment of the present application.

Fig. 3 is an exemplary diagram of extracting acoustic features in the embodiment of the present application.

Fig. 4 is a block diagram of a pronunciation skill detection model according to an embodiment of the present application.

FIG. 5 is a block diagram of a phoneme feature extraction network in the pronunciation skill detection model.

Fig. 6 is a block diagram showing the structure of a phoneme feature extraction submodule inside a phoneme feature module in the phoneme feature extraction network.

FIG. 7 is another block diagram of the structure of the phoneme feature extraction network in the pronunciation skill detection model.

FIG. 8 is a block diagram of the structure of the feature coding module inside the acoustic feature enhancement network in the pronunciation skill detection model.

FIG. 9 is a block diagram of a refinement of the acoustic feature enhancement module within the acoustic feature enhancement network in the pronunciation skill detection model.

FIG. 10 is a block diagram of a feature fusion network in the pronunciation skill detection model.

Fig. 11 is a block diagram showing the structure of a first pronunciation skill detection network in the pronunciation skill detection model.

Fig. 12 is a block diagram showing the structure of a branch detection network inside the second pronunciation skill detection network in the pronunciation skill detection model.

Fig. 13 is a block diagram of a pronunciation skill detection apparatus according to an embodiment of the present application.

Fig. 14 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

It is to be appreciated that the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Relational terms such as first and second, and the like may be used solely to distinguish one object or operation from another object or operation without necessarily limiting the actual sequential relationship between the objects or operations. In the description of the embodiments of the present application, "a plurality" means two or more unless specifically defined otherwise.

Artificial Intelligence (AI) is a theory, method, technique and application system that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly includes Machine Learning (ML) technology, in which Deep Learning (DL) is a new research direction in Machine Learning, and is introduced into Machine Learning to make it closer to the original target, i.e., artificial intelligence. At present, deep learning is mainly applied in the fields of computer vision, natural language processing and the like.

Deep learning is the intrinsic regularity and expression hierarchy of learning sample data, and the information obtained in these learning processes is of great help to the interpretation of data such as text, images and sound. By using the deep learning technology and the corresponding training data set, network models realizing different functions can be obtained through training, for example, a deep learning network for gender classification can be obtained through training based on one training data set, an image optimized deep learning network can be obtained through training based on another training data set, and the like.

In order to improve the efficiency of pronunciation skill detection, the application introduces deep learning into pronunciation skill detection, and correspondingly provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device. Wherein the pronunciation skill detection method is executable by the electronic device.

Referring to fig. 1, the present application further provides a pronunciation skill detection system, as shown in fig. 1, the pronunciation skill detection system includes an electronic device 100, for example, the electronic device may acquire a text to be detected for pronunciation skill detection, convert the text to be detected into a corresponding phoneme sequence, when the electronic device is further configured with a microphone, perform audio acquisition during the period when a speaker speaks the text to be detected, thereby acquire a to-be-detected audio obtained when the speaker speaks the text to be detected, extract an acoustic feature of the to-be-detected audio, then further input the acquired phoneme sequence and acoustic feature into a trained pronunciation skill detection model to perform pronunciation skill detection processing, so as to obtain a first detection result and a second detection result, where the first detection result is used to indicate whether the text to be detected needs to be spoken by pronunciation skill, the second detection result is used for representing whether the speaker speaks the text to be detected by using pronunciation skills.

The electronic device 100 may be any device equipped with a processor and having processing capability, such as a mobile electronic device with a processor, such as a smart phone, a tablet computer, a palm computer, and a notebook computer, or a stationary electronic device with a processor, such as a desktop computer, a television, and a server.

In addition, as shown in fig. 1, the pronunciation skill detection system may further include a storage device 200 for storing data, including but not limited to raw data, intermediate data, result data, and the like obtained in the pronunciation skill detection process, for example, the electronic device 100 may store the obtained text to be detected, the audio to be detected, the phoneme sequence converted from the text to be detected, the acoustic features extracted from the audio to be detected, and the first detection result and the second detection result output by the pronunciation skill detection model in the storage device 200.

It should be noted that the scene schematic diagram of the pronunciation skill detection system shown in fig. 1 is only an example, and the pronunciation skill detection system and the scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and it is known by those skilled in the art that the technical solution provided in the embodiment of the present application is also applicable to similar technical problems with the evolution of the pronunciation skill detection system and the appearance of new service scenes.

Referring to fig. 2, fig. 2 is a flowchart illustrating a pronunciation skill detection method according to an embodiment of the present disclosure. As shown in fig. 2, the flow of the pronunciation skill detection method provided by the embodiment of the present application may be as follows:

in S310, a text to be detected is obtained, and the text to be detected is converted into a corresponding phoneme sequence.

The text to be detected refers to a text for pronunciation skill detection, and the pronunciation skill detection comprises detecting whether the text to be detected needs to be spoken by pronunciation skill or not and detecting whether the speaker speaks the text to be detected by pronunciation skill or not.

It should be noted that different languages such as chinese and english have respective pronunciation skills, and english is taken as an example, and there are pronunciation skills such as continuous reading, loss of explosion, and turbidity.

The continuous reading refers to natural spelling of head and tail phonemes of two adjacent words without pause in the middle.

The loss of blasting means that when two blasting sounds (such as p, d, t, k and g) are adjacent, the former blasting sound is only provided with a sounding mouth according to the sounding part to form a barrier, but the former blasting sound is not blasted, and the latter consonant is sounded after a slight pause. The former plosive is called a lost shot, e.g., goo (d) bye.

The voicing is the value of the unvoiced consonant as the preceding sound/s/, the unvoiced consonant having its corresponding voiced consonant, and the unvoiced consonant having a vowel following it, and the unvoiced consonant is then read as its corresponding voiced consonant. Taking speark as an example, unvoiced consonant/p/preceded by/s/this sound,/p/corresponding voiced consonant is/b/,/p/followed by vowel/i:/, and at this time, the original/spi: k/is read as/sbi: k/.

Therefore, the pronunciation skill detection method provided by the application can be used for pronunciation skill detection of any language, and correspondingly, the text to be detected can be the text of any language according to the requirement of actual pronunciation skill detection. After the obtained text to be detected is obtained, the electronic equipment further converts the obtained text to be detected into a corresponding phoneme sequence. For example, the electronic device may convert the acquired text to be detected into a corresponding phoneme sequence according to a pronunciation dictionary.

In an alternative embodiment, converting the text to be detected into the corresponding phoneme sequence includes:

removing unvoiced text units in the text to be detected to obtain a new text to be detected;

and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.

It will be appreciated that for any text, not all of the text units need to be pronounced when the text is spoken, e.g., for punctuation in the text, pronunciation is not required.

Therefore, in order to eliminate the interference of the unvoiced text unit and improve the accuracy of the pronunciation skill detection, when the electronic device converts the text to be detected into the corresponding phoneme sequence, the unvoiced text units (such as punctuations, emoticons, and the like) in the text to be detected are removed to obtain a new text to be detected, and then each text unit in the new text to be detected is converted into the corresponding phoneme sequence according to the pronunciation dictionary.

For example, when pronunciation skill detection of english is required, the electronic device obtains that a text to be detected of english is "Please turn on the light", a text unit-punctuation mark "in the text to be detected is an unvoiced text unit, removes the unvoiced text unit", obtains that a new text to be detected is "Please turn on the light", further converts each text unit in the new text to be detected into a corresponding phoneme unit according to a pronunciation dictionary, and obtains a phoneme sequence

In addition, in order to more clearly characterize the phoneme sequence, the electronic device may further add a start flag and an end flag before and after the phoneme sequence, respectively, characterize the beginning of the phoneme sequence by the start flag, and characterize the end of the phoneme sequence by the end flag. The specific configuration of the start flag and the end flag is not limited herein, and can be configured by those skilled in the art according to actual needs.

For example, the start flag may be configured as "<bos>", the configuration end flag is"<eos>", the electronic device is in the phoneme sequence above

After adding the start flag and the end flag, change to

In S320, the audio to be detected obtained by the speaker speaking the text to be detected is obtained, and the acoustic features of the audio to be detected are extracted.

In this embodiment, the electronic device not only converts the text to be detected into the corresponding phoneme sequence, but also obtains the audio to be detected, which is obtained by the speaker speaking the text to be detected. The data format of the audio to be detected is not particularly limited, and can be configured by those skilled in the art according to the real detection requirement.

The speaker may be a real person or a virtual person.

For example, when a speaker is a real person, the electronic device may perform audio acquisition on a voice of a text to be detected spoken by the real person through configured audio acquisition devices (which may be built-in audio acquisition devices or external audio acquisition devices), and use the acquired audio as the audio to be detected; in addition, the electronic equipment can also acquire the audio to be detected, which is acquired by other electronic equipment and is used by a real person to speak the text to be detected, from other electronic equipment. Correspondingly, the electronic equipment can detect the pronunciation capability of the real person by using the pronunciation skill detection method provided by the application by using the audio to be detected acquired at the moment.

For another example, when the speaker is a virtual person, such as speech synthesis software based on artificial intelligence, the electronic device may directly input the text to be detected into the speech synthesis software, perform speech synthesis by the speech synthesis software, output the audio obtained by the synthesis, and use the audio as the audio to be detected. Correspondingly, the electronic equipment can be used for detecting the voice synthesis capability of the voice synthesis software by using the acquired audio to be detected.

As described above, after the electronic device acquires the audio to be detected obtained by the speaker speaking the text to be detected, the acoustic features of the audio to be detected are also extracted. The acoustic features refer to physical quantities representing acoustic characteristics of speech, and are also a general term for acoustic representation of various elements of sound, such as energy concentration areas representing tone colors, formant frequencies, formant intensities, bandwidths, and the like, and duration, fundamental frequency, average speech power, and the like representing prosodic characteristics of speech.

In an optional embodiment, to further improve the accuracy of pronunciation skill detection, the extracting the acoustic features of the audio to be detected includes:

extracting Filterbank characteristics, fundamental frequency characteristics and energy characteristics of the audio to be detected;

and fusing the Filterbank characteristics, the fundamental frequency characteristics and the energy characteristics to obtain the acoustic characteristics.

In this embodiment, the Filterbank feature, the fundamental frequency feature, and the energy feature are used as acoustic features related to pronunciation skills, and accordingly, when the acoustic features for pronunciation skill detection are extracted, the electronic device extracts the Filterbank feature, the fundamental frequency feature, and the energy feature of the audio to be detected. For example, in this embodiment, the electronic device may extract the Filterbank feature of the audio to be detected in 40 dimensions.

After the Filterbank feature, the fundamental frequency feature and the energy feature of the audio to be detected are extracted, the electronic equipment further fuses the Filterbank feature, the fundamental frequency feature and the energy feature according to the configured fusion strategy to obtain a fusion feature, and the fusion feature is used as an acoustic feature for pronunciation skill detection. The configuration of the fusion policy is not particularly limited, and may be configured by those skilled in the art according to actual needs.

For example, referring to fig. 3, the fusion strategy configured in the present embodiment is to splice the Filterbank feature, the fundamental frequency feature, and the energy feature according to the time dimension, so as to obtain the acoustic feature for pronunciation skill detection.

In S330, the phoneme sequence and the acoustic features are input into the trained pronunciation skill detection model for pronunciation skill detection processing, so as to obtain a first detection result and a second detection result.

It should be noted that, the present application trains the corresponding pronunciation skill detection models in advance for different languages, for example, for chinese, the pronunciation skill detection model for carrying out pronunciation skill detection for chinese is trained in advance, and for english, the pronunciation skill detection model for carrying out pronunciation skill detection for english is trained in advance. The structure and training mode of the pronunciation skill detection model are not specifically limited, and can be selected by those skilled in the art according to actual needs.

The pronunciation skill detection model is configured to take the acoustic features of the audio to be detected from the fact that the speaker speaks the text to be detected and the phoneme sequence from the text to be detected as input, and correspondingly output a detection result used for representing whether the speaker needs to speak the text to be detected by adopting pronunciation skill or not and a detection result used for representing whether the speaker speaks the text to be detected by adopting pronunciation skill or not.

Accordingly, in this embodiment, after the phoneme sequence and the acoustic feature are obtained, the electronic device inputs the obtained phoneme sequence and acoustic feature into a trained pronunciation skill detection model matched with the language of the text to be detected, and performs pronunciation skill detection processing to obtain a first detection result and a second detection result output by the pronunciation skill detection model. The first detection result is used for representing whether the text to be detected needs to be spoken by adopting pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by adopting pronunciation skills.

According to expert knowledge, the phoneme of "n" and "a" is matched with a continuous reading phoneme, the phoneme of "n" and "a" need continuous reading, aiming at a phoneme sequence and acoustic characteristics corresponding to the text to be detected, a first detection result output by the pronunciation skill detection model indicates that the text to be detected needs to be output by adopting pronunciation skill "continuous reading", and the text to be detected is spoken by the speaker by adopting pronunciation skill or not, and a second detection result corresponding to the pronunciation skill detection model is output by the pronunciation skill detection model.

In addition, the phoneme sequence may be input as an original phoneme sequence, or may be digitally encoded and then converted into a digital phoneme sequence, that is, the converted phoneme sequence represents the corresponding phonemes by numbers. Accordingly, if the phoneme sequence is input in a digital form, the phoneme sequence needs to be trained in a digital form as a phoneme sequence sample when the pronunciation skill detection model is trained.

In an optional embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the phoneme sequence and the acoustic features are input into the trained pronunciation skill detection model to perform pronunciation skill detection processing, so as to obtain a first detection result and a second detection result, including:

inputting the phoneme sequence into a phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix;

inputting the phoneme feature matrix into a first pronunciation skill detection network to perform pronunciation skill detection processing to obtain a first detection result;

if the first detection result represents that the text to be detected needs to be spoken by adopting pronunciation skills, inputting the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;

inputting the enhanced acoustic feature matrix and the phoneme feature matrix into a feature fusion network for feature fusion processing to obtain a fusion feature matrix;

and inputting the fusion feature matrix into a second pronunciation skill detection network to perform pronunciation skill detection processing to obtain a second detection result.

Referring to fig. 4, the pronunciation skill detection model provided in this embodiment is composed of 5 major parts, which are a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network and a second pronunciation skill detection network, respectively.

The phoneme feature extraction network is configured to perform feature extraction on an input phoneme sequence to obtain a phoneme feature matrix reflecting the interrelation between phonemes in the phoneme sequence.

The acoustic feature enhancement network is configured to perform feature enhancement processing on the input acoustic features, enhance features in which pronunciation skills are more relevant, and obtain an enhanced acoustic feature matrix.

The feature fusion network is configured to perform feature fusion on the input phoneme feature matrix and the enhanced acoustic feature matrix without performing mutual information of phonemes and acoustic features to obtain a fusion feature matrix.

The first pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input phoneme feature matrix and output a first detection result for representing whether the text to be detected needs to be spoken by adopting pronunciation skill.

The second pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input fusion feature matrix and output a second detection result for representing whether the speaker speaks the text to be detected by using pronunciation skill.

Correspondingly, in this embodiment, when the phoneme sequence and the acoustic feature are input into the trained pronunciation skill detection model for pronunciation skill detection processing, the electronic device may perform feature extraction processing on the phoneme sequence and phoneme feature extraction network to obtain a phoneme feature matrix, and then input the phoneme feature matrix into the first pronunciation skill detection network for pronunciation skill detection processing to obtain a first detection result.

Meanwhile, the electronic equipment inputs the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix, inputs the enhanced acoustic feature and the phoneme feature matrix into a feature fusion network for feature fusion processing to obtain a fusion feature matrix, and inputs the fusion feature matrix obtained through fusion into a second pronunciation skill detection network for pronunciation skill detection processing to obtain a second detection result.

In addition, the electronic device may also determine whether to output the second detection result according to the first detection result. After obtaining a first detection result and a second detection result obtained by carrying out pronunciation skill detection processing on a pronunciation skill detection model by the electronic equipment, determining whether to speak a text to be detected by adopting pronunciation skill according to the first detection result, if determining that the text to be detected needs to be spoken by adopting pronunciation skill, outputting the first detection result and the second detection result by the electronic equipment at the same time, representing that the text to be detected needs to be spoken by adopting pronunciation skill according to the first detection result at the moment, and representing whether a speaker speaks the text to be detected by adopting pronunciation skill according to the second detection result at the moment; if it is determined that the text to be detected does not need to be spoken by pronunciation skills, the electronic device may discard the second detection result and output only the first detection result.

In other embodiments, if the first detection result represents that the text to be detected needs to be spoken by using pronunciation skills, the electronic device inputs the acoustic features into the acoustic feature enhancement network for feature enhancement processing, so as to obtain an enhanced acoustic feature matrix. And then, the electronic equipment further inputs the enhanced acoustic characteristic features and the phoneme characteristic matrix into a characteristic fusion network for characteristic fusion processing to obtain a fusion characteristic matrix. And finally, the electronic equipment inputs the fused feature matrix obtained by fusion into a second pronunciation skill detection network to carry out pronunciation skill detection processing, so as to obtain a second detection result.

In addition, if the first detection result represents that the text to be detected does not need to be spoken by using pronunciation skills, further pronunciation skill detection is not needed, and at the moment, the electronic equipment does not use acoustic features to perform pronunciation skill detection any more and can only output the first detection result.

In an optional embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and the phoneme sequence is input into the phoneme feature extraction network to perform feature extraction processing, so as to obtain a phoneme feature matrix, including:

inputting the phoneme sequence into a phoneme embedding module for embedding to obtain a phoneme vector matrix;

and inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.

Referring to fig. 5, in the present embodiment, the phoneme feature extraction network is composed of two parts, namely a phoneme embedding module and a phoneme feature extraction module, wherein the phoneme embedding module is configured to embed an input phoneme sequence and vectorize the input phoneme sequence to obtain a phoneme vector matrix; the phoneme feature extraction module is configured to perform feature extraction on the input phoneme vector matrix to obtain a phoneme feature matrix reflecting the interrelation between phonemes in the phoneme sequence.

Correspondingly, in this embodiment, when the phoneme sequence is input to the phoneme feature extraction network for feature extraction, the electronic device first inputs the phoneme sequence to the phoneme embedding module for embedding to obtain a phoneme vector matrix, and then inputs the phoneme vector matrix to the phoneme feature extraction module for feature extraction to obtain a phoneme feature matrix.

The phoneme feature extraction module comprises at least 1 phoneme feature extraction submodule, the phoneme vector matrix is input into the phoneme feature extraction module to be subjected to feature extraction processing, and a phoneme feature matrix is obtained, and the phoneme feature extraction module comprises:

when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule for feature extraction processing to obtain a phoneme feature matrix; alternatively, the first and second electrodes may be,

and when the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer greater than 1. The value of N is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, N may be configured as 2.

In this embodiment, the audio feature extraction module may be composed of 1 phoneme feature extraction submodule, or may be composed of N phoneme feature extraction submodules connected in sequence. When the audio feature extraction module is composed of N phoneme feature extraction submodules, each audio extraction submodule performs the same feature extraction processing. The following description will take the feature extraction process of 1 audio feature extraction sub-module as an example.

Referring to fig. 6, the audio feature extraction sub-module is composed of 3 sub-layers, which are a first matrix conversion layer, a first multi-attention layer and a first matrix fusion layer, respectively,

the first matrix conversion layer is configured to perform matrix conversion processing on an input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;

the first multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;

the first matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the first matrix conversion layer and the output matrix of the first multi-head attention layer to obtain a fusion matrix.

Correspondingly, when the number of the phoneme feature extraction submodules is 1, the electronic device may extract the phoneme feature matrix as follows:

inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix which are respectively marked as a first query matrix, a first key matrix and a first value matrix;

inputting the first query matrix, the first key matrix and the first value matrix into a first multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and recording the attention enhancement matrix as a first attention enhancement matrix;

and inputting the first attention enhancing matrix and the phoneme vector matrix into the first matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a phoneme feature matrix.

It can be understood that, when the number of the phoneme feature extraction submodules is N, the feature extraction process is sequentially performed by the N phoneme feature submodules according to the above formula by using only the 1 st phoneme feature extraction submodule that inputs the phoneme vector matrix into the N phoneme feature extraction submodules connected in sequence, and the fusion feature output by the N phoneme feature submodule is used as the phoneme feature matrix.

In this embodiment, no specific limitation is imposed on how to perform matrix fusion on the first matrix fusion layer, and a person skilled in the art can configure the first matrix fusion layer according to actual needs.

For example, the first matrix fusion layer may include two sublayers, which are an addition layer and a layer normalization layer, respectively, and when matrix fusion is performed, the two input matrices are added by the addition layer to obtain a sum matrix, and then layer normalization processing is performed on the sum matrix by the layer normalization layer to obtain a fusion matrix.

In an optional embodiment, referring to fig. 7, the phoneme feature extraction network further includes a first position encoding module and a second matrix fusion layer, and before the phoneme vector matrix is input to the phoneme feature extraction module for feature extraction processing, the method further includes:

inputting the phoneme vector matrix into a first position coding module for position coding processing to obtain a first position coding matrix;

inputting the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;

inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, wherein the phoneme feature matrix comprises:

and inputting the phoneme position fusion matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.

In this embodiment, in order to further improve the accuracy of pronunciation skill detection, in this embodiment, the original phoneme vector matrix is not input to the phoneme feature extraction module for feature extraction, but after position coding is performed on the original phoneme vector matrix, the phoneme vector matrix carrying the position information is input to the phoneme feature extraction module for feature extraction.

The electronic device inputs the phoneme vector matrix into a first position coding module to perform position coding processing, so as to obtain a position coding matrix, and the position coding matrix is recorded as a first position coding matrix, and the first position coding matrix represents position information of each matrix unit in the phoneme vector matrix, and may be relative position information or absolute position information.

After the first position coding matrix is obtained, the electronic device inputs the first position coding matrix and the phoneme vector matrix into the second matrix fusion layer for matrix fusion processing, and a fusion matrix is obtained and recorded as a phoneme position fusion matrix. And then, the electronic equipment further inputs the phoneme position fusion matrix carrying the position information into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix. For how the phoneme feature extraction module performs the feature extraction processing, please refer to the related description of the above embodiments, which is not described herein again.

In addition, it should be noted that, in this embodiment, the matrix fusion manner of the second matrix fusion layer is not particularly limited, and may be configured by a person skilled in the art according to actual needs. For example, the second matrix fusion layer is configured to perform addition processing on two input matrices, and output a sum matrix obtained by the addition as a fusion matrix.

In an optional embodiment, the acoustic feature enhancement network includes a feature coding module and at least 1 acoustic feature enhancement module, and the method includes inputting acoustic features into the acoustic feature enhancement network to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, and includes:

inputting the acoustic features into a feature coding module for feature coding processing to obtain an acoustic feature matrix;

when the number of the acoustic feature enhancing modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancing module for feature enhancing treatment to obtain an enhanced acoustic feature matrix; alternatively, the first and second electrodes may be,

and when the number of the acoustic feature enhancing modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancing modules to sequentially perform feature enhancement processing to obtain an enhanced acoustic feature matrix, wherein M is an integer greater than 1. The value of M is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, M may be configured to be 4.

It should be noted that the Filterbank feature, the fundamental frequency feature and the energy feature obtained in the above embodiment are all presented in the form of feature maps, and correspondingly, the acoustic feature obtained by fusing the Filterbank feature, the fundamental frequency feature and the energy feature is also presented in the form of feature maps.

In order to effectively perform feature enhancement processing on acoustic features, in this embodiment, an acoustic feature enhancement network is composed of 1 feature coding module and at least 1 acoustic feature enhancement module, where the feature coding module is configured to perform coding processing on acoustic features, and compress feature dimensions to obtain a corresponding acoustic feature matrix; the acoustic feature enhancement module is configured to perform feature enhancement processing on the input acoustic feature matrix, and enhance features related to pronunciation skills therein to obtain an enhanced acoustic feature matrix.

Correspondingly, when the acoustic features are input into the acoustic feature enhancement network for feature enhancement processing, the electronic device firstly inputs the acoustic features into the feature coding module for feature coding processing to obtain an acoustic feature matrix.

Referring to fig. 8, the feature coding module is composed of 4 sub-layers, including a first convolutional layer, a first pooling layer, a second convolutional layer, and a second pooling layer, wherein,

the first convolution layer is configured to perform convolution processing on the input feature map to obtain a corresponding convolution result;

the first pooling layer is configured to pool convolution results output by the first convolution layer to obtain corresponding pooling results;

the second convolution layer is configured to perform convolution processing on the pooling result output by the first pooling layer to obtain a corresponding convolution result;

the second pooling layer is configured to pool convolution results output by the second convolution layer to obtain a feature matrix corresponding to the feature map.

Correspondingly, the electronic device can input the feature coding module to perform feature coding processing according to the following mode:

inputting the acoustic features into the first convolution layer for convolution processing to obtain a convolution result, and recording the convolution result as a first convolution result;

inputting the first convolution result into a first pooling layer for pooling treatment to obtain a pooling result, and recording the pooling result as a first pooling result;

inputting the first pooling result into a second convolution layer for convolution processing to obtain a convolution result, and recording the convolution result as a second convolution result;

and inputting the second convolution result into a second pooling layer for pooling to obtain an acoustic feature matrix.

It should be noted that, in this embodiment, no specific limitation is imposed on the convolution kernel size, the step size, and the padding size of the first convolution layer and the second convolution layer, and the pooling kernel size and the step size of the first pooling layer and the second pooling layer, and those skilled in the art can configure the convolution kernel size, the step size, and the padding size according to actual needs.

For example, in this embodiment, the convolution kernel size of the first convolutional layer is [3,3], the step size is [1,1], and the padding size is [1,1], the convolution kernel size of the second convolutional layer is [3,3], the step size is [1,1], and the padding size is [1,1], the pooling type of the first pooling layer is maximum pooling, the pooling kernel size is [2,2], the step size is [1,1], the pooling type of the second pooling layer is maximum pooling, the pooling kernel size is [2,2], and the step size is [1,1 ].

Further, when the number of the acoustic feature enhancing modules is 1, the electronic device directly inputs the acoustic feature matrix into the acoustic feature enhancing module to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix.

The following description will take the feature enhancement processing procedure of 1 acoustic feature enhancement module as an example.

Referring to fig. 9, the acoustic feature enhancement module is composed of 6 sub-layers, which are a second matrix conversion layer, a second multi-attention layer, a third matrix fusion layer, a third convolution layer, an inverse convolution layer and a fourth matrix fusion layer. Wherein the content of the first and second substances,

the second matrix conversion layer is configured to perform matrix conversion processing on the input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;

the second multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;

the third matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the second matrix conversion layer and the output matrix of the second multi-head attention layer to obtain a fusion matrix;

the third convolution layer is configured to perform convolution processing on the input fusion matrix to obtain a convolution result;

the deconvolution layer is configured to perform deconvolution processing on the input convolution result to obtain a matrix-form deconvolution result;

the fourth matrix fusion layer is configured to perform matrix fusion processing on the fusion matrix output by the third matrix fusion layer and the deconvolution result output by the deconvolution layer to obtain a fusion matrix.

Accordingly, when the number of the acoustic feature enhancement modules is 1, the electronic device may enhance the obtained enhanced acoustic feature matrix as follows:

inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, and recording the query matrix, the key matrix and the value matrix as a second query matrix, a second key matrix and a second value matrix respectively;

inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and recording the attention enhancement matrix as a second attention enhancement matrix;

inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as an acoustic fusion matrix;

inputting the acoustic fusion matrix into a third convolution layer for convolution processing to obtain a third convolution result;

inputting the third convolution result into the deconvolution layer for deconvolution processing to obtain a matrix-form deconvolution result;

and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording as an enhanced acoustic feature matrix.

How to perform matrix fusion on the third matrix fusion layer and the fourth matrix fusion layer is not specifically limited in this embodiment, and can be configured by those skilled in the art according to actual needs.

For example, the third matrix fusion layer and the fourth matrix fusion layer have the same structure and each include two sublayers, namely an addition layer and a layer normalization layer, and when matrix fusion is performed, the addition layer is used for adding the two input matrices to obtain a sum matrix, and then the layer normalization layer is used for performing layer normalization on the sum matrix to obtain a fusion matrix.

It should be noted that, when there are M acoustic feature enhancement modules, each acoustic feature enhancement module performs the same feature enhancement processing, and only 1 st acoustic feature enhancement module of the M sequentially connected acoustic feature enhancement modules needs to input an acoustic feature matrix, and the M acoustic feature enhancement modules sequentially perform the feature enhancement processing according to the above formula, and use the fusion feature output by the M acoustic feature enhancement modules as an enhanced acoustic feature matrix.

In addition, in the present embodiment, the configuration of the convolution kernel size, the step size, and the padding size in the third convolutional layer and the reverse convolutional layer is not specifically limited, and can be taken by those skilled in the art according to actual needs.

It can be understood that, the implementation can enhance the acoustic features in the form of feature patterns more effectively by enhancing the convolution processing and the deconvolution processing in the process of enhancing the acoustic features, and finally extract the features which are more relevant to pronunciation skills.

In an optional embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing, the method further includes:

inputting the acoustic feature matrix into a second position coding module for position coding processing to obtain a second position coding matrix;

inputting the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;

inputting the acoustic feature matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix, wherein the method comprises the following steps:

and inputting the acoustic position fusion matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.

In this embodiment, in order to further improve the accuracy of pronunciation skill detection, in this embodiment, the original acoustic feature matrix is not input to the acoustic feature enhancement module for feature enhancement, but after the acoustic feature matrix is subjected to position coding, the acoustic feature matrix carrying position information is input to the acoustic feature enhancement module for feature enhancement.

The electronic device firstly inputs the acoustic feature matrix into a second position coding module for position coding processing to obtain a position coding matrix, and the position coding matrix is recorded as a second position coding matrix, and the second position coding matrix represents position information of each matrix unit in the acoustic feature matrix, and can be relative position information or absolute position information.

After the second position coding matrix is obtained, the electronic device inputs the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing, and a fusion matrix is obtained and recorded as an acoustic position fusion matrix. And then, the electronic equipment further inputs the acoustic position fusion matrix carrying the position information into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix. For how the acoustic feature enhancement module performs the feature enhancement processing, please refer to the related description of the above embodiments, which is not described herein again.

In addition, it should be further noted that, in the present embodiment, the matrix fusion manner of the fifth matrix fusion layer is not particularly limited, and may be configured by a person skilled in the art according to actual needs. For example, the fifth matrix fusion layer is configured to perform addition processing on two input matrices, and output a sum matrix obtained by the addition as a fusion matrix.

In an alternative embodiment, referring to fig. 10, the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-attention layer, a sixth matrix fusion layer, a feed-forward network layer, and a seventh matrix fusion layer, wherein,

the third matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a key matrix and a value matrix;

the fourth matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a query matrix;

the third multi-head attention layer is configured to perform attention enhancement processing on the key matrix and the value matrix output by the third matrix conversion layer and the query matrix output by the fourth matrix conversion layer to obtain an attention enhancement matrix;

the sixth matrix fusion layer is configured to perform matrix fusion processing on the attention enhancement matrix output by the third multi-head attention layer to obtain a fusion matrix;

the feedforward network layer is configured to perform feedforward calculation processing on the fusion matrix output by the sixth matrix fusion layer to obtain a feedforward matrix;

and the seventh matrix fusion layer is configured to perform matrix fusion processing on the feedforward matrix output by the feedforward network layer and the fusion matrix output by the sixth matrix fusion layer to obtain a fusion characteristic matrix.

Correspondingly, the electronic device may input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing as follows:

inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion processing to obtain a key matrix and a value matrix which are respectively marked as a third key matrix and a third value matrix;

inputting the phoneme feature matrix into a fourth matrix conversion layer for matrix conversion processing to obtain a query matrix, and recording the query matrix as a third query matrix;

inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and recording the attention enhancement matrix as a third attention enhancement matrix;

inputting the third attention enhancing matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as an acoustic phoneme fusion matrix;

inputting the acoustic phoneme fusion matrix into a feedforward network layer to perform feedforward calculation processing to obtain a feedforward matrix;

and inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a fusion characteristic matrix.

How to perform matrix fusion on the sixth matrix fusion layer and the seventh matrix fusion layer is not specifically limited in this embodiment, and those skilled in the art can perform configuration according to actual needs.

For example, the sixth matrix fusion layer and the seventh matrix fusion layer have the same structure and each include two sublayers, namely an addition layer and a layer normalization layer, and when matrix fusion is performed, the addition layer is used to add the two input matrices to obtain a sum matrix, and then the layer normalization layer is used to perform layer normalization on the sum matrix to obtain a fusion matrix.

In an alternative embodiment, referring to fig. 11, the first pronunciation skill detection network includes a first fully-connected layer and a first classification function layer, and the pronunciation skill detection processing is performed by inputting the phoneme feature matrix into the first pronunciation skill detection network to obtain the first detection result, including:

inputting the phoneme feature matrix into a first full-connection layer to perform full-connection processing to obtain a first full-connection result;

and inputting the first full-connection result into a first classification function layer for classification processing to obtain a first detection result.

It should be noted that, since the present embodiment is to perform pronunciation skill detection on multiple types of pronunciation skills, any multiple classification function may be used in the first classification function layer.

Taking the Softmax function as an example, the dimension of the output vector of the Softmax function is matched with the number of pronunciation skills expected to be detected, for example, the english language is taken as an example, the pronunciation skills expected to be detected include continuous reading, loss of blasting and turbidization, and the output vector of the Softmax function includes elements of four dimensions, wherein 1 element is used for representing whether the text to be detected needs to be spoken by the pronunciation skill "continuous reading", 1 element is used for representing whether the text to be detected needs to be spoken by the pronunciation skill "loss of blasting", 1 element is used for representing whether the text to be detected needs to be spoken by the pronunciation skill "turbidization", and one element is used for representing that the text to be detected does not need to be spoken by the pronunciation skill.

Correspondingly, the first full-link result is input into the Softmax function to obtain a 4-dimensional output vector of the Softmax function, the 4-dimensional output vector is used as a first detection result, whether the text to be detected needs to be spoken by pronunciation skills or not can be determined according to the first detection result, and when the text to be detected needs to be spoken by pronunciation skills, the text to be detected needs to be spoken by which pronunciation skills.

In an optional embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the fused feature matrix is input to the second pronunciation skill detection network for pronunciation skill detection processing to obtain the second detection result, including:

inputting the fusion characteristic matrix into each branch detection network to carry out pronunciation skill detection to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether a speaker speaks a text to be detected by adopting pronunciation skill corresponding to each branch detection network;

and obtaining a second detection result according to the branch pronunciation skill detection result of each branch detection network.

Wherein each branch detection network corresponds to a pronunciation skill and is configured to detect whether the speaker speaks the text to be detected by using the corresponding pronunciation skill, and accordingly, the number of branch detection networks can obtain the number of branch pronunciation skill detection results,

for example, for english, when the pronunciation skills desired to be detected include continuous reading, blasting loss and turbidity, L takes a value of 3, that is, the second pronunciation skill detection network will include 3 branch detection networks, where 1 branch detection network corresponds to pronunciation skill "continuous reading", 1 branch detection network corresponds to pronunciation skill "blasting loss", and 1 branch detection network corresponds to pronunciation skill "turbidity". Correspondingly, the 3 branch detection networks respectively output 1 branch pronunciation skill detection result, and the 3 branch pronunciation skill detection results are obtained to be combined into a second detection result. At this time, the second detection result represents whether the speaker has spoken the text to be detected by using pronunciation skills, and specifically what kind of pronunciation skills is used to speak the text to be detected when the speaker has spoken the text to be detected by using pronunciation skills.

It should be noted that, the structure of each branch detection network is the same, and a branch detection network is taken as an example for description below, referring to fig. 12, the branch detection network includes a second full connection layer and a second classification function layer, and the fused feature matrix is input into each branch detection network to perform pronunciation skill detection, so as to obtain a branch pronunciation skill detection result of each branch detection network, including:

inputting the fusion characteristic matrix into a second full-connection layer to perform full-connection processing to obtain a second full-connection result;

and inputting the second full-connection result into a second classification function layer for classification processing to obtain a branch pronunciation skill detection result.

It should be noted that the second classification function layer may employ any classification function.

Taking the sigmoid function as an example, the output value of the sigmoid function is located at [0,1], and after model training, the output of the sigmoid function can represent the probability that the speaker speaks the text to be detected by using the pronunciation skills corresponding to the branch detection network where the speaker is located. For example, when the output value of the sigmoid function reaches a preset threshold (an empirical value can be obtained by a person skilled in the art according to actual needs), it can be determined that the speaker has spoken the text to be detected by using the pronunciation skill corresponding to the branch detection network where the speaker is located.

Correspondingly, the second full-connection result is input into the sigmoid function to obtain an output value of the sigmoid function, and the output value is used as a branch pronunciation skill detection result. According to the branch pronunciation skill detection result, whether the speaker adopts the pronunciation skill corresponding to the branch detection network to speak the text to be detected can be judged.

In an optional embodiment, before obtaining the text to be detected and converting the text to be detected into the corresponding phoneme sequence, the method further includes:

acquiring multiple types of first sample texts which are known to need to be spoken by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;

acquiring first sample audio of each type of first sample text spoken by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic characteristics of the first sample audio of each type of first sample text;

obtaining a second sample text which is known not to be spoken by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;

acquiring a second sample audio of a second sample text spoken by a sample user, and extracting negative sample acoustic features of the second sample audio;

and performing model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, the negative sample phoneme sequence and the negative sample acoustic feature to obtain a pronunciation skill detection model.

In this embodiment, the acoustic feature samples and the phoneme sequence samples are not artificially constructed, but a data-driven idea is used to enable the model to learn different pronunciation techniques from a large amount of data. The following description will be given by taking a specific language as an example.

Aiming at the language, the electronic equipment respectively acquires multiple types of first sample texts which are known to need to be spoken by adopting different pronunciation skills. The number of the first sample texts of each kind of pronunciation skills to be obtained is not particularly limited, and can be configured by those skilled in the art according to actual needs.

For the first sample texts obtained for each type of pronunciation skills (hereinafter referred to as each type of first sample texts), the electronic equipment respectively converts each type of first sample texts into a corresponding phoneme sequence, which is recorded as a positive sample phoneme sequence. For how to convert the first text into the phoneme sequence, the above embodiment may be referred to and implemented correspondingly in a manner of converting the text to be detected into the phoneme sequence, and details are not repeated here.

The electronic equipment also obtains the audio frequency of each type of first sample text which is spoken by the sample user by adopting different pronunciation skills and records the audio frequency as the first sample audio frequency, and extracts the acoustic feature of the first sample audio frequency of each type of first sample text and records the acoustic feature as the positive sample acoustic feature. The sample user may be a real person with pronunciation skills or a virtual person with pronunciation skills, and accordingly, how to obtain the first sample audio of each type of the first sample text and how to extract the acoustic features of the positive sample text may be implemented correspondingly by referring to the manner of obtaining the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiments, which is not described herein again.

In addition, the electronic device obtains a text known to be not required to be spoken with pronunciation skills, and records the text as a second sample text, and converts the second sample text into a corresponding phoneme sequence, which is recorded as a negative sample phoneme sequence. For how to convert the second sample text into the phoneme sequence, the above embodiment may be referred to and implemented correspondingly in a manner of converting the text to be detected into the phoneme sequence, and details are not repeated here.

The electronic equipment also obtains the audio of the second sample text spoken by the sample user, and records the audio as the second sample audio, and extracts the acoustic features of the second sample audio, and records the acoustic features as the negative sample acoustic features. For how to obtain the second sample audio of the second sample text and how to extract the acoustic features of the negative sample, the corresponding implementation can be performed by referring to the manner of obtaining the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiments, and details are not described here.

It should be noted that, in the present embodiment, the number of sample users is not specifically limited, and may be configured by a person skilled in the art according to actual needs, for example, in the present embodiment, the positive sample phoneme sequence, the positive sample acoustic feature, the negative sample phoneme sequence, and the negative sample acoustic feature are obtained by 500 sample users.

After the positive sample phoneme sequence, the positive sample acoustic features, the negative sample phoneme sequence and the negative sample acoustic features are obtained, the electronic equipment performs model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic features, the negative sample phoneme sequence and the negative sample acoustic features until a preset stopping condition is met, and then a pronunciation skill detection model is obtained. The preset stop condition may be configured to set the number of iterations of the model in the training process to reach a preset number, or the model converges.

Referring to fig. 13, in order to better implement the pronunciation skill detection method provided by the present application, the present application further provides a pronunciation skill detection apparatus 400, as shown in fig. 13, the pronunciation skill detection apparatus 400 includes:

a first obtaining module 410, configured to obtain a text to be detected, and convert the text to be detected into a corresponding phoneme sequence;

the second obtaining module 420 is configured to obtain a to-be-detected audio obtained by a speaker speaking a to-be-detected text, and extract an acoustic feature of the to-be-detected audio;

the detection module 430 is configured to input the phoneme sequence and the acoustic features into the trained pronunciation skill detection model to perform pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;

In an alternative embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the detection module 430 is configured to:

In an alternative embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and the detection module 430 is configured to:

In an alternative embodiment, the phone feature extraction module includes at least 1 phone feature extraction sub-module, and the detection module 430 is configured to:

and when the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer greater than 1.

In an alternative embodiment, the phoneme feature extraction submodule includes a first matrix conversion layer, a first multi-attention layer and a first matrix fusion layer, and the detection module 430 is configured to:

inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix;

inputting the first query matrix, the first key matrix and the first value matrix into a first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;

and inputting the first attention enhancing matrix and the phoneme vector matrix into the first matrix fusion layer for matrix fusion processing to obtain a phoneme feature matrix.

In an optional embodiment, the phoneme feature extraction network further includes a first position encoding module and a second matrix fusion layer, and before the phoneme vector matrix is input to the phoneme feature extraction module for feature extraction processing, so as to obtain the phoneme feature matrix, the detection module 430 is further configured to:

when the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction to obtain a phoneme feature matrix, the detection module 430 is configured to input the phoneme position fusion matrix into the phoneme feature extraction module for feature extraction to obtain the phoneme feature matrix.

In an optional embodiment, the acoustic feature enhancement network includes a feature coding module and at least 1 acoustic feature enhancement module, and the detection module 430 is configured to:

and when the number of the acoustic feature enhancing modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancing modules to sequentially perform feature enhancement processing to obtain an enhanced acoustic feature matrix, wherein M is an integer greater than 1.

In an alternative embodiment, the feature encoding module includes a first convolution layer, a first pooling layer, a second convolution layer, and a second pooling layer, and the detection module 430 is configured to:

inputting the acoustic features into the first convolution layer to carry out convolution processing to obtain a first convolution result;

inputting the first convolution result into a first pooling layer for pooling to obtain a first pooling result;

inputting the first pooling result into a second convolution layer for convolution processing to obtain a second convolution result;

In an alternative embodiment, the acoustic feature enhancement module includes a second matrix conversion layer, a second multi-head attention layer, a third matrix fusion layer, a third convolution layer, an inverse convolution layer, and a fourth matrix fusion layer, and the detection module 430 is configured to:

inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;

inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix;

inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix;

inputting the third convolution result into the deconvolution layer for deconvolution processing to obtain a deconvolution result;

and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain an enhanced acoustic feature matrix.

In an optional embodiment, the acoustic feature enhancement network further includes a second location coding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, the detection module 430 is further configured to:

when the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, the detection module 430 is configured to:

In an alternative embodiment, the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-attention layer, a sixth matrix fusion layer, a feedforward network layer, and a seventh matrix fusion layer, and the detection module 430 is configured to:

inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion processing to obtain a third key matrix and a third value matrix;

inputting the phoneme feature matrix into a fourth matrix conversion layer for matrix conversion processing to obtain a third query matrix;

inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix;

inputting the third attention enhancing matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix;

and inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion characteristic matrix.

In an alternative embodiment, the first voice skill detection network comprises a first fully connected layer and a first classification function layer, and the detection module 430 is configured to:

In an alternative embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponding to a different pronunciation skill, L being an integer greater than 1, the detection module 430 is configured to:

In an alternative embodiment, the branch detection network includes a second fully-connected layer and a second classification function layer, and the detection module 430 is configured to:

In an optional embodiment, the pronunciation skill detection apparatus provided by the present application further comprises a training module for:

In an alternative embodiment, the second obtaining module 420 is configured to:

In an alternative embodiment, the first obtaining module 410 is configured to:

It should be noted that the pronunciation skill detection apparatus 400 provided in the embodiment of the present application and the pronunciation skill detection method in the above embodiments belong to the same concept, and the specific implementation process thereof is described in the above related embodiments, and is not described herein again.

The embodiment of the present application further provides an electronic device, which includes a memory and a processor, wherein the processor is configured to execute the steps in the pronunciation skill detection method provided in this embodiment by calling a computer program stored in the memory.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure.

The electronic device 100 may include components such as a network interface 110, a memory 120, a processor 130, and a screen assembly. Those skilled in the art will appreciate that the configuration of electronic device 100 shown in FIG. 14 is not intended to be limiting of electronic device 100 and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

The network interface 110 may be used to make network connections between devices.

The memory 120 may be used to store computer programs and data. The memory 120 stores computer programs having executable code embodied therein. The computer program may be divided into various functional modules. The processor 130 executes various functional applications and data processing by executing computer programs stored in the memory 120.

The processor 130 is a control center of the electronic apparatus 100, connects various parts of the entire electronic apparatus 100 using various interfaces and lines, and performs various functions of the electronic apparatus 100 and processes data by running or executing computer programs stored in the memory 120 and calling data stored in the memory 120, thereby performing overall control of the electronic apparatus 100.

In the embodiment of the present application, the processor 130 in the electronic device 100 loads the executable code corresponding to one or more computer programs into the memory 120 according to the following instructions, and the processor 130 executes the steps in the pronunciation skill detection method provided by the present application, such as:

It should be noted that the electronic device 100 provided in the embodiment of the present application and the pronunciation skill detection method in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the foregoing related embodiments, and are not described herein again.

The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when executed on a processor of an electronic device provided in an embodiment of the present application, causes the processor of the electronic device to execute any of the above steps in the pronunciation skill detection method suitable for the electronic device. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

The pronunciation skill detection method, the pronunciation skill detection device, the pronunciation skill detection storage medium and the electronic device provided by the present application are introduced in detail, and a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A pronunciation skill detection method, comprising:

acquiring a to-be-detected audio obtained by a speaker speaking the to-be-detected text, and extracting acoustic characteristics of the to-be-detected audio;

the first detection result is used for representing whether the text to be detected needs to be spoken by pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by pronunciation skills.

2. The pronunciation skill detection method according to claim 1, wherein the pronunciation skill detection model comprises a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network and a second pronunciation skill detection network, and the inputting the phoneme sequence and the acoustic features into the trained pronunciation skill detection model for pronunciation skill detection processing to obtain a first detection result and a second detection result comprises:

inputting the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix;

inputting the phoneme feature matrix into the first pronunciation skill detection network to perform pronunciation skill detection processing to obtain the first detection result;

inputting the acoustic features into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;

inputting the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix;

and inputting the fusion feature matrix into the second pronunciation skill detection network to perform pronunciation skill detection processing to obtain the second detection result.

3. The pronunciation skill detection method as claimed in claim 2, wherein the phoneme feature extraction network comprises a phoneme embedding module and a phoneme feature extraction module, and the inputting the phoneme sequence into the phoneme feature extraction network for feature extraction to obtain a phoneme feature matrix comprises:

inputting the phoneme sequence into the phoneme embedding module for embedding to obtain a phoneme vector matrix;

and inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix.

4. The pronunciation skill detection method as claimed in claim 3, wherein the phone feature extraction module comprises at least 1 phone feature extraction sub-module, and the inputting the phone vector matrix into the phone feature extraction module for feature extraction to obtain the phone feature matrix comprises:

when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule for feature extraction processing to obtain the phoneme feature matrix; alternatively, the first and second electrodes may be,

and when the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain the phoneme feature matrix, wherein N is an integer greater than 1.

5. The pronunciation skill detection method as claimed in claim 4, wherein the phoneme feature extraction submodule comprises a first matrix conversion layer, a first multi-attention layer and a first matrix fusion layer, and the inputting the phoneme vector matrix into the phoneme feature extraction submodule for feature extraction to obtain the phoneme feature matrix comprises:

inputting the phoneme vector matrix into the first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix;

inputting the first query matrix, the first key matrix and the first value matrix into the first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;

and inputting the first attention enhancing matrix and the phoneme vector matrix into the first matrix fusion layer for matrix fusion processing to obtain the phoneme feature matrix.

6. The pronunciation skill detection method as claimed in claim 3, wherein the phoneme feature extraction network further comprises a first position coding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction to obtain the phoneme feature matrix, the method further comprises:

inputting the phoneme vector matrix into the first position coding module for position coding processing to obtain a first position coding matrix;

inputting the first position coding matrix and the phoneme vector matrix into the second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;

the inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix includes:

and inputting the phoneme position fusion matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix.

7. The pronunciation skill detection method according to claim 2, wherein the acoustic feature enhancement network comprises a feature coding module and at least 1 acoustic feature enhancement module, and the inputting the acoustic features into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix comprises:

inputting the acoustic features into the feature coding module for feature coding processing to obtain an acoustic feature matrix;

when the number of the acoustic feature enhancing modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancing module for feature enhancing processing to obtain the enhanced acoustic feature matrix; alternatively, the first and second electrodes may be,

and when the number of the acoustic feature enhancing modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancing modules to sequentially perform feature enhancement processing to obtain the enhanced acoustic feature matrix, wherein M is an integer greater than 1.

8. The pronunciation skill detection method as claimed in claim 7, wherein the feature encoding module comprises a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, and the inputting the acoustic features into the feature encoding module for feature encoding to obtain an acoustic feature matrix comprises:

inputting the first convolution result into the first pooling layer for pooling treatment to obtain a first pooling result;

inputting the first pooling result into the second convolution layer for convolution processing to obtain a second convolution result;

and inputting the second convolution result into the second pooling layer for pooling processing to obtain the acoustic feature matrix.

9. The pronunciation skill detection method according to claim 7, wherein the acoustic feature enhancement module comprises a second matrix conversion layer, a second multi-attention layer, a third matrix fusion layer, a third convolution layer, a reverse convolution layer and a fourth matrix fusion layer, and the inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix comprises:

inputting the acoustic feature matrix into the second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;

inputting the second query matrix, the second key matrix and the second value matrix into the second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix;

inputting the second attention enhancement matrix and the acoustic feature matrix into the third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix;

inputting the acoustic fusion matrix into the third convolution layer for convolution processing to obtain a third convolution result;

and inputting the acoustic fusion matrix and the deconvolution result into the fourth matrix fusion layer for matrix fusion processing to obtain the enhanced acoustic feature matrix.

10. The pronunciation skill detection method according to claim 7, wherein the acoustic feature enhancement network further comprises a second location coding module and a fifth matrix fusion layer, and before inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix, the method further comprises:

inputting the acoustic feature matrix into the second position coding module for position coding processing to obtain a second position coding matrix;

inputting the second position coding matrix and the acoustic feature matrix into the fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;

the inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix includes:

and inputting the acoustic position fusion matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix.

11. The pronunciation skill detection method according to claim 2, wherein the feature fusion network comprises a third matrix conversion layer, a fourth matrix conversion layer, a third multi-attention layer, a sixth matrix fusion layer, a feedforward network layer and a seventh matrix fusion layer, and the inputting the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix comprises:

inputting the enhanced acoustic feature matrix into the third matrix conversion layer for matrix conversion processing to obtain a third key matrix and a third value matrix;

inputting the phoneme feature matrix into the fourth matrix conversion layer for matrix conversion processing to obtain a third query matrix;

inputting the third query matrix, the third key matrix and the third value matrix into the third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix;

inputting the third attention enhancing matrix and the third query matrix into the sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix;

inputting the acoustic phoneme fusion matrix into the feedforward network layer to perform feedforward calculation processing to obtain a feedforward matrix;

and inputting the feedforward matrix and the acoustic phoneme fusion matrix into the seventh matrix fusion layer for matrix fusion processing to obtain the fusion feature matrix.

12. The pronunciation skill detection method according to claim 2, wherein the first pronunciation skill detection network comprises a first fully connected layer and a first classification function layer, and the inputting the phoneme feature matrix into the first pronunciation skill detection network for pronunciation skill detection processing to obtain the first detection result comprises:

inputting the phoneme feature matrix into the first full-connection layer to perform full-connection processing to obtain a first full-connection result;

and inputting the first full-connection result into the first classification function layer for classification processing to obtain the first detection result.

13. The pronunciation skill detection method according to claim 2, wherein the second pronunciation skill detection network comprises L branch detection networks, each branch detection network corresponding to a different pronunciation skill, L being an integer greater than 1, and the inputting the fused feature matrix into the second pronunciation skill detection network for pronunciation skill detection processing to obtain the second detection result comprises:

inputting the fusion feature matrix into each branch detection network to perform pronunciation skill detection to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether the speaker speaks the text to be detected by adopting pronunciation skill corresponding to each branch detection network;

and obtaining the second detection result according to the branch pronunciation skill detection result of each branch detection network.

14. The pronunciation skill detection method according to claim 13, wherein the branch detection network comprises a second fully connected layer and a second classification function layer, and the inputting the fused feature matrix into each branch detection network for pronunciation skill detection to obtain the branch pronunciation skill detection result of each branch detection network comprises:

inputting the fusion characteristic matrix into the second full-connection layer to perform full-connection processing to obtain a second full-connection result;

and inputting the second full-connection result into the second classification function layer for classification processing to obtain the branch pronunciation skill detection result.

15. The pronunciation skill detection method as claimed in claim 13, wherein before the obtaining the text to be detected and converting the text to be detected into the corresponding phoneme sequence, the method further comprises:

obtaining multiple types of first sample texts which are known to need to be spoken by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;

acquiring first sample audio of each type of first sample text spoken by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;

acquiring a second sample audio of the second sample text spoken by the sample user, and extracting negative sample acoustic features of the second sample audio;

and performing model training according to each type of the positive sample phoneme sequence, each type of the positive sample acoustic features, the negative sample phoneme sequence and the negative sample acoustic features to obtain the pronunciation skill detection model.

16. The pronunciation skill detection method according to any one of claims 1-15, wherein the extracting the acoustic features of the audio to be detected comprises:

and fusing the Filterbank feature, the fundamental frequency feature and the energy feature to obtain the acoustic feature.

17. The pronunciation skill detection method according to any one of claims 1-15, wherein the converting the text to be detected into a corresponding phoneme sequence comprises:

and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain the phoneme sequence.

18. A pronunciation skill detection apparatus, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a text to be detected and converting the text to be detected into a corresponding phoneme sequence;

the detection module is used for inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model to carry out pronunciation skill detection processing so as to obtain a first detection result and a second detection result;

19. A storage medium having stored thereon a computer program for performing the steps of the pronunciation skill detection method as claimed in any one of claims 1-17 when the computer program is loaded by a processor.

20. An electronic device comprising a processor and a memory, said memory storing a computer program, wherein said processor is adapted to perform the steps of the pronunciation skill detection method as claimed in any one of claims 1 to 17 by loading said computer program.