CN112185363B

CN112185363B - Audio processing method and device

Info

Publication number: CN112185363B
Application number: CN202011131544.3A
Authority: CN
Inventors: 高强; 王卓然; 王宏伟; 夏龙; 刘前; 闫永超; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2024-02-13
Anticipated expiration: 2040-10-21
Also published as: CN112185363A

Abstract

The present specification provides an audio processing method and apparatus, wherein the audio processing method includes: acquiring an audio file containing at least two languages; determining a feature matrix corresponding to the audio file, and inputting the feature matrix into a voice recognition model for processing to obtain a target text containing a language identifier; determining target characters respectively corresponding to at least two languages contained in the target text according to the language identifier, and determining the audio duration of the audio file; calculating the speech speed of a sound source in the audio file based on the target characters and the audio duration which are respectively corresponding to the at least two languages; the voice speed of the mixed language is accurately determined, and the use requirements of different service scenes are further met.

Description

Audio processing method and device

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio processing method and apparatus.

Background

With the development of internet technology, the speech recognition technology is applied to a wider application scene, such as an instant messaging scene, a video playing scene or an audio playing scene, etc., which all relate to the application of the speech recognition technology, and the speech speed is one of emotion expression modes, so that not only the speaking rhythm of a user can be reflected, but also a means for adjusting the expression mode by the user. In the speech processing scenario, the speech speed of the user speaking in the speech is an important means for processing the speech, and in the prior art, when the speech speed of the user speaking in the speech is recognized, the speech speed of the speaking user is generally estimated according to the syllable rate, but a syllable does not necessarily express a character in different languages, which results in the problem that the estimated speech speed is inconsistent with the real speech speed when the mixed speech is estimated (including at least two languages), thus affecting the downstream business processing, so an effective scheme is needed to solve the problem.

Disclosure of Invention

In view of this, the present embodiments provide an audio processing method. The present disclosure also relates to an audio processing apparatus, a computing device, and a computer-readable storage medium, which solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided an audio processing method, including:

acquiring an audio file containing at least two languages;

determining a feature matrix corresponding to the audio file, and inputting the feature matrix into a voice recognition model for processing to obtain a target text containing a language identifier;

determining target characters respectively corresponding to at least two languages contained in the target text according to the language identifier, and determining the audio duration of the audio file;

and calculating the speech speed of a sound source in the audio file based on the target characters and the audio duration which are respectively corresponding to the at least two languages.

Optionally, the inputting the feature matrix into a speech recognition model for processing to obtain the target text including the language identifier includes:

inputting the feature matrix into the voice recognition model, performing feature coding through an encoder in the voice recognition model, and outputting a feature sequence of the audio file;

After the feature sequence is introduced into an attention mechanism, decoding is carried out through a decoder in the voice recognition model, and a target feature sequence of the audio file is output;

and processing the target feature sequence through an output layer in the voice recognition model, and outputting the target text containing the language identifier.

Optionally, the determining the feature matrix corresponding to the audio file includes:

carrying out framing treatment on the audio file to obtain a plurality of audio frames;

determining feature vectors corresponding to the plurality of audio frames respectively;

and generating the feature matrix corresponding to the audio file based on the feature vectors respectively corresponding to the plurality of audio frames.

Optionally, the speech recognition model is trained by:

acquiring a sample audio file, and carrying out framing treatment on the sample audio file to obtain a plurality of sample audio frames;

determining sample feature vectors corresponding to the plurality of sample audio frames respectively, and forming a sample feature matrix corresponding to the sample audio file based on the sample feature vectors;

determining a sample text corresponding to the sample audio file, and adding a language identifier into the sample text according to the language type contained in the sample text to obtain a sample target text;

And training an initial speech recognition model based on the sample feature matrix and the sample target text to obtain the speech recognition model.

Optionally, the determining, according to the language identifier, the target characters respectively corresponding to at least two languages included in the target text includes:

determining language sub-identifiers respectively corresponding to at least two languages contained in the target text in the language identifiers;

classifying the target text according to the language sub-identifiers respectively corresponding to the at least two languages to obtain target sub-texts respectively corresponding to the at least two languages;

and identifying characters contained in the target sub-texts respectively corresponding to the at least two languages, and determining the target characters respectively corresponding to the at least two languages according to the identification result.

Optionally, the determining the audio duration of the audio file includes:

constructing a volume amplitude characteristic corresponding to an audio file, and determining a silent audio fragment in the audio file according to the volume amplitude characteristic;

determining a silent audio duration of the silent audio clip and an audio total duration of the audio file;

and calculating the difference value between the total audio duration and the silent audio duration to obtain the audio duration.

Optionally, the calculating the speech rate of the sound source in the audio file based on the target characters and the audio duration respectively corresponding to the at least two languages includes:

determining the character quantity of the target characters corresponding to the at least two languages respectively, and summing the character quantity of the target characters corresponding to the at least two languages respectively to obtain the total character quantity;

and calculating the ratio of the total character number to the audio duration to obtain the speech speed of the sound source in the audio file.

Optionally, after the step of calculating the speech speed of the sound source in the audio file based on the target characters and the audio duration respectively corresponding to the at least two languages is performed, the method further includes:

determining language audio clips corresponding to the at least two languages in the audio file respectively;

and adjusting the language audio clips corresponding to the at least two languages respectively according to the language speed, and generating a target audio file according to an adjustment result.

Optionally, the determining the feature vectors corresponding to the plurality of audio frames respectively includes:

windowing is carried out on the plurality of audio frames, and a first frequency spectrum corresponding to the plurality of audio frames is constructed according to the windowing result;

And converting the first frequency spectrum into a second frequency spectrum through a preset filter bank, and performing cepstrum processing on the first frequency spectrum to obtain the feature vectors corresponding to the plurality of audio frames respectively.

According to a second aspect of embodiments of the present specification, there is provided an audio processing apparatus comprising:

the acquisition module is configured to acquire an audio file containing at least two languages;

the processing module is configured to determine a feature matrix corresponding to the audio file, input the feature matrix into the voice recognition model for processing, and obtain a target text containing a language identifier;

the determining module is configured to determine target characters respectively corresponding to at least two languages contained in the target text according to the language identifier, and determine audio duration of the audio file;

and the calculating module is configured to calculate the speech speed of the sound source in the audio file based on the target characters respectively corresponding to the at least two languages and the audio duration.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions:

Acquiring an audio file containing at least two languages;

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the audio processing method.

According to the audio processing method provided by the specification, after the audio file containing at least two languages is obtained, the feature matrix corresponding to the audio file is determined, then the feature matrix is input into the voice recognition model for processing, the target text containing the language identifiers is obtained, the multiple languages in the audio file are accurately divided by combining the language identifiers, then the target characters corresponding to the different languages are determined according to the language identifiers, meanwhile, the audio duration of the audio file is determined, finally, the language speed of a sound source in the audio file is calculated based on the target characters and the audio duration corresponding to the different languages, errors of language speed estimation can be effectively avoided, the characters corresponding to the different languages are identified by combining the language identifiers, the calculation accuracy of the mixed audio language speed for multiple languages is further improved, and the effective implementation of a subsequent audio processing process is more convenient.

Drawings

FIG. 1 is a flow chart of an audio processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a model processing procedure in an audio processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a model structure in an audio processing method according to an embodiment of the present disclosure;

fig. 4 is a process flow diagram of an audio processing method applied to a mixed audio recognition scene according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a computing device according to one embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

Speech rate: the speaking speed of the user; phonemes: is the smallest phonetic unit divided according to the natural attribute of the speech.

Syllable rate: in the process of speaking, the syllable rate of pronunciation is as follows: syllables/min.

Convolutional neural network (CNN, convolutional Neural Network): a neural network; is a type of feedforward neural network that includes convolution calculations and has a depth structure.

Forward neural network (DNN, dense Neural Network): a deep neural network.

Recurrent neural network (RNN, recurrent Neural Network): a neural network; is a kind of recurrent neural network which takes sequence data as input, performs recursion (recovery) in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chained manner.

CDNN: a neural network is realized by CNN and DNN.

BLSTM (Bidirectional Long Short-Term Memory): an RNN is composed of forward LSTM (Long Short-Term Memory) and backward LSTM.

Encoder-Decoder: a network architecture, used in conjunction with RNNs, may map one sequence to another, and the length of the two sequences may be unequal.

Attention mechanism (Attention): a mechanism to enhance the effectiveness of neural networks.

Chinese and English mixed audio: the content of the speaker's speech contains both Chinese and English audio.

In the present specification, an audio processing method is provided, and the present specification relates to an audio processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

In practical application, the speech speed of the user speaking in the audio cannot be estimated to accurately reflect the real speaking speech speed of the user, so that a certain error exists in subsequent audio processing; when mixed audio is involved (the audio contains at least two languages), because individual syllables in different languages do not necessarily correspond to individual characters, the accuracy of estimating the speech speed is reduced again, and the subsequent error becomes larger, so that the normal operation of the subsequent audio processing process is affected, and therefore, it is important to accurately calculate the speech speed of speaking of a user in the mixed audio.

According to the audio processing method, in order to improve the accuracy of voice speed calculation, after the audio file containing at least two languages is obtained, the feature matrix corresponding to the audio file is determined, then the feature matrix is input into the voice recognition model for processing, the target text containing the language identifiers is obtained, the fact that the languages in the audio file are divided by combining the language identifiers accurately is achieved, then target characters which do not correspond to the languages respectively are determined according to the language identifiers, meanwhile the audio time length of the audio file is determined, finally the voice speed calculation of a sound source in the audio file is carried out based on the target characters and the audio time length which correspond to the different languages respectively, errors of voice speed estimation can be effectively avoided, the characters corresponding to the different languages are recognized by combining the language identifiers, the accuracy of calculating the mixed audio voice speed of multiple languages is further improved, and the effective subsequent audio processing process is more convenient.

Fig. 1 shows a flowchart of an audio processing method according to an embodiment of the present disclosure, which specifically includes the following steps:

step S102, an audio file containing at least two languages is obtained.

In the implementation, because the syllables corresponding to different languages in pronunciation are different, when the speech speed of a user is calculated, if the speech speed is calculated according to syllables, the problem of inaccurate speech speed calculation is caused, particularly, when mixed audio is involved, if the speech speed is calculated according to the mode that a single syllable corresponds to a single character, the error between the calculated speech speed and the actual speech speed is larger; for example, the audio file is a Chinese-English mixed audio, because Chinese is generally a Chinese character corresponding to a syllable, english has a word corresponding to a syllable, and also has a situation that a word corresponds to two or more syllables, when the syllable speech speed calculation method is adopted to calculate the speech speed of a user in the Chinese-English mixed audio, the accuracy of speech speed calculation cannot be met, so that the subsequent audio processing process is affected.

Based on this, this application carries out the language discernment through the audio file that will contain two at least languages to carry out the determination of target character according to the language, thereby can realize accurate calculation audio in the audio sound source's speech rate, not only can accomplish the speech rate discernment to mixed audio, can also guarantee the normal clear of follow-up audio processing procedure, further satisfy the demand that audio processing scene's speech rate accuracy calculated.

Further, the audio file specifically refers to audio containing at least two languages, and the at least two languages specifically refer to languages with different expressions, such as chinese (chinese), english (english), korean (korean), and so on; in this embodiment, the audio processing method will be described by taking the case that the audio file contains two languages, i.e., chinese and english, and the processing procedures of the other languages may refer to the corresponding description contents of this implementation, which are not repeated here.

Step S104, determining a feature matrix corresponding to the audio file, and inputting the feature matrix into a voice recognition model for processing to obtain a target text containing a language identifier.

Specifically, on the basis of the obtained audio files containing at least two languages, further, in order to obtain the audio files meeting the use requirements, the speaking speed of the user in the audio files needs to be calculated, so that the audio files are adjusted; in the process, as the mode of speaking by the user is Chinese and English mixed expression, the characters in the audio file are required to be divided according to the types of languages, so that the speech speed is calculated according to the languages, and the accuracy of speech speed calculation can be effectively improved.

Based on the above, firstly, determining a feature matrix corresponding to the audio file, wherein the feature matrix specifically refers to a matrix formed by the mel spectrum reciprocal coefficient characteristics (MFCC (Mel Frequency Cepstrum Coefficient) characteristics) of each frame of audio in the audio file; and secondly, inputting the feature matrix into the voice recognition model for processing, adding a language identifier according to the type of the language contained in the audio file while converting the audio file into the text, so that the sub-texts corresponding to different languages can be distinguished according to the language identifier, and characters corresponding to different languages can be counted conveniently and subsequently.

It should be noted that different languages will correspond to different language sub-identifiers, and the language sub-identifiers are not repeated, so that statistics of characters of different languages can be conveniently performed subsequently; the language sub-identifier may be set according to actual requirements, for example, the language sub-identifier for identifying chinese is set to [ CN ], the language sub-identifier for identifying english is set to [ EN ], the language sub-identifier for identifying korean is set to [ KR ], etc., and accordingly, the symbol for expressing the end of audio may be set to [ E ], etc., and the specific setting of the language seed identification character may be set according to actual requirements.

For example, referring to fig. 2, the content corresponding to the chinese-english mixed audio of the user speaking is "greta child is good at night," we learn animals today, at this time, the feature matrix S corresponding to the chinese-english mixed audio is determined, and the feature matrix S is input into the speech recognition model for processing, so as to obtain the chinese-english text containing the language sub-identifier as "[ EN ] greta [ CN ] child is good at night," we learn [ EN ] animals "today, where [ EN ] represents the english identifier, [ CN ] represents the chinese identifier, [ E ] represents the audio ending symbol, and after obtaining the chinese-english text containing the language identifier, the speech speed of the user is calculated later, so as to improve the accuracy of the speech speed calculation.

Further, since the audio file cannot be used as input of the model, it is necessary to determine a feature matrix corresponding to the audio file, and then use the feature matrix in the speech recognition model, in this embodiment, the process of determining the feature matrix is as follows:

Converting the first frequency spectrum into a second frequency spectrum through a preset filter bank, and performing cepstrum processing on the first frequency spectrum to obtain feature vectors corresponding to the plurality of audio frames respectively;

Specifically, because the voice in the audio file is continuously changed within a certain range and cannot be processed without fixed characteristics, the plurality of audio frames are windowed at the moment, so that the problem of discontinuous signals in each audio frame is solved, and in practical application, the window function used in the windowing can be a square window function, a hamming window function or a hanning window function; the first frequency spectrum specifically refers to a frequency spectrum obtained through Fourier transformation, the second frequency spectrum specifically refers to a Mel frequency obtained through processing the first frequency spectrum through a Mel filter bank, and the feature vector specifically refers to an MFCC feature vector corresponding to each audio frame respectively.

Based on the above, firstly, framing the audio file to obtain the plurality of audio frames, secondly, windowing each audio frame by adopting a window function, performing fourier transform on each short-time analysis window according to the windowing result to obtain a first frequency spectrum corresponding to the plurality of audio frames, converting the obtained first frequency spectrum through a Mel filter bank to obtain a second frequency spectrum, performing cepstrum analysis (taking logarithm, performing inverse transform, namely, realizing through DCT (Discrete Cosine Transform) discrete cosine transform, taking the 2 nd to 13 th coefficients of DCT as MFCC coefficients) on the second frequency spectrum to obtain Mel frequency spectrum cepstrum coefficients MFCC, wherein Mel frequency spectrum cepstrum coefficients MFCC are feature vectors corresponding to each audio frame, and finally, generating a feature matrix corresponding to the audio file according to the feature vectors corresponding to the plurality of audio frames respectively to serve as input of a model for language identification processing. It should be noted that, during the framing process, the length of each frame may be set according to the actual requirement, and the length range may be set between 10 ms and 30 ms.

In summary, before processing an audio file containing at least two languages, the mel spectrum reciprocal coefficient feature of each frame of audio is extracted, and a feature matrix corresponding to the audio file is generated based on the mel spectrum reciprocal coefficient feature to serve as input of a model, so that audio processing efficiency is effectively improved.

Further, after determining the feature matrix corresponding to the audio file, the feature matrix needs to be input into a speech recognition model for processing, so as to obtain a target text meeting the subsequent use requirement, and in this process, in order to be able to meet the requirement of accurately recognizing the multi-language audio file, the target text meeting the use requirement needs to be output, and the model needs to be trained in a targeted manner to complete the process, where in this embodiment, the speech recognition model is trained in the following manner:

Specifically, the voice recognition model not only needs to complete language recognition, but also needs to convert the audio into text, so that the voice recognition model needs to satisfy the condition that the audio is mapped to a character sequence through a deep neural network; referring to fig. 3, the speech recognition model may employ an encocoder-Decoder architecture, and uses a neural network BLSTM (Bidirectional Long Short-Term Memory) to introduce a Attention mechanism (Attention) between the encocoder and the Decoder, so as to generate a target text corresponding to an audio file when processing a feature matrix corresponding to the audio file, where the target text has language identifiers for dividing languages, so as to facilitate the subsequent calculation of the speech speed of a user speaking in the audio; wherein MFCC1, MFCC2 … … MFCC m represents mel-spectrum reciprocal coefficient characteristics corresponding to each audio frame, token1, token2 … … Token represents a character sequence of the audio file map as input to the speech recognition model. It should be noted that, a plurality of layers of BLSTMs are disposed in both the Encoder layer and the Decoder layer, so as to realize more accurate recognition of the mapped characters in the audio file.

Based on this, after an initial speech recognition model satisfying the use requirement is constructed, the initial speech recognition model is trained at this time: firstly, a sample audio file is acquired, wherein the sample audio file is an audio file containing a plurality of languages, and different voice recognition models can be trained for the audio of different language combinations because the types of the languages contained in different audios are possibly different, so that the method can be applied to more speech speed calculation scenes; secondly, carrying out framing treatment on the sample audio file to obtain a plurality of sample audio frames, setting each sample audio frame to be 10ms, and extracting sample feature vectors (the inverse coefficient features of the Mel frequency spectrum) corresponding to each sample audio frame respectively to generate a sample feature matrix as the input of a model; and then determining a sample text corresponding to the sample audio file, adding a language sub-identifier to the sample text according to languages contained in the sample text, obtaining a sample target text as the input of a model, and training an initial speech recognition model by using the sample feature matrix and the sample target text, thereby obtaining a speech recognition model meeting the use requirements.

Further, in the process of training the initial speech recognition model, in order to obtain the speech recognition model meeting the use requirement, the loss value of the loss function of the model can be continuously monitored in the training process, when the loss value meets the preset threshold, the model which is completed in current training can be determined to be used, and then the model which is completed in current training is used as the speech recognition model for subsequent audio processing.

For example, the content in the sample audio file is hello, the toy is sent to you, at this time, the Chinese and English speech recognition model is trained based on the sample audio file, firstly, the sample audio file is subjected to framing, the MFCC characteristic of each frame of audio after framing is extracted to generate a feature matrix as the input of the model, secondly, the content in the sample audio file is sent to you, namely, hello, the toy is sent to an additive seed identifier, so as to obtain a sample target text, "[ EN ] hello [ CN ] sends the toy to you [ E ] as the output of the model, finally, the Chinese and English speech recognition model is trained based on the feature matrix and the sample target text, so that the Chinese and English mixed audio can be obtained for language recognition, and is converted into the model of the text, and the Chinese and English mixed audio can be conveniently recognized and converted.

In conclusion, by adopting the neural network with the framework to construct the voice recognition model, the accuracy of model recognition languages can be improved, and the problems of word staggering, word leakage, multiple words and the like of converted texts can be avoided, so that the accuracy of speaking speed of a user in follow-up calculation audio frequency is effectively improved.

Furthermore, after the speech recognition model meeting the use requirement is obtained, the audio processing method is utilized, so as to realize accurate recognition of the language mixed audio.

Specifically, after a speech recognition model meeting the use requirement is obtained based on the training, processing a feature matrix corresponding to the audio file by adopting the speech recognition model, so as to obtain a target text containing a language identifier; based on the above, the feature matrix is firstly input into the speech recognition model, and an encoder in the model encodes feature vectors corresponding to each frame of audio in the feature matrix, so that feature subsequences corresponding to each audio frame are obtained, and all feature subsequences can form feature sequences of the audio file; and finally, the target feature sequence is processed through an output layer in the voice recognition model, so that the audio file can be mapped to a text, and language identifiers can be added in the text for different languages, thereby facilitating the subsequent statistics of the number of characters in the text.

That is, after the decoder decodes, the language sub-identifier (identifier corresponding to one language) is outputted, and then the character corresponding to the language sub-identifier is outputted until the output of the other language sub-identifier or the audio ending symbol is stopped; and after the identifier of the other language is output, continuing to output the character corresponding to the identifier of the other language until the identifier of the other language is output or the audio ending symbol is stopped, and the like, finally outputting the audio ending symbol, wherein the identification of the audio file is finished, and then the subsequent audio processing process is finished.

In addition, the voice recognition model provided in this embodiment may be formed by combining a plurality of voice recognition sub-models corresponding to each of the languages and a multilingual classification sub-model, so that only voices belonging to the same language in the audio file are recognized by the voice recognition sub-model corresponding to each of the languages and converted into texts, then language identifiers are added to the converted texts by the multilingual classification sub-model, and finally the texts are assembled into target texts for use in a subsequent audio processing process.

On the other hand, the voice recognition model adopting the Encoder-Decoder architecture can be replaced by other deep neural network architectures, and the BLSTM neural network can also be replaced by other neural network architectures, and in specific implementation, the voice recognition model can be set according to the actual application scenario, and the embodiment is not limited in any way.

In sum, through inputting the feature sequence of introducing attention mechanism to the decoder and decoding, not only can fully consider the mutual influence between each audio frame through attention mechanism, still promote the richness and the precision of characteristic through the mode of feature fusion to realize obtaining the target text that satisfies user demand more through the model, effectively improve the precision of follow-up speech rate calculation.

And step S106, determining target characters respectively corresponding to at least two languages contained in the target text according to the language identifier, and determining the audio duration of the audio file.

Specifically, on the basis of obtaining the target text containing the language identifier, further, at this time, the characters contained in the target text are counted, but because the characters corresponding to different languages are different, a counting process needs to be completed according to the language identifier, for example, one character represents a Chinese character in Chinese and one character represents a word in English; therefore, the characters corresponding to different languages contained in the target text need to be determined according to the language identifier, wherein the target characters specifically refer to the characters corresponding to the different languages, and accordingly, the audio duration of the audio file needs to be determined, and the audio duration specifically refers to the speaking duration of the user in the audio file.

Based on the above, after determining the target characters respectively corresponding to the at least two languages contained in the target text, the number of the target characters respectively corresponding to the at least two languages contained in the target text can be counted, so that the subsequent calculation of the speech speed is convenient.

Further, in the process of determining the target characters corresponding to at least two languages in the target text, since the target text may contain more characters and correspond to different languages, the characters corresponding to each language may be determined in a classification manner, and in this embodiment, the specific implementation manner is as follows:

Specifically, the language sub-identifier specifically refers to a character corresponding to each language, the target sub-text specifically refers to a text corresponding to each language, and the text belongs to the target text. Based on the above, firstly, language sub-identifiers corresponding to at least two languages contained in the target text in the language identifiers are determined, secondly, the target text is classified according to the language sub-identifiers corresponding to the languages respectively, so that target sub-texts corresponding to the languages respectively are obtained, namely, the target sub-texts belonging to the same language are formed, and finally, the target characters corresponding to the languages respectively can be determined by identifying characters contained in the target sub-texts corresponding to the languages respectively.

Along the above example, when obtaining chinese-english text "[ EN ] greta [ CN ] corresponding to chinese-english mixed audio, we learn [ EN ] animals [ E ] at night, then determine that two kinds of language sub-identifiers are contained in the chinese-english text, which are respectively [ CN ] corresponding to chinese and [ EN ] corresponding to english, then classify the chinese-english text according to the chinese-language sub-recognition characters and the english-language sub-recognition characters, determine that the chinese-corresponding target sub-text is" child has good at night ", we learn today" and the english-corresponding target sub-text is "greta/animals", thereby determining that the chinese-corresponding target character is { small, punt, friend, late, upper, good, i, so as to date, learn, have }, and the english-corresponding target character is { greta, animals }, for the calculation of the subsequent language speed.

In conclusion, the languages are respectively classified through the language sub-recognition characters, and the target characters belonging to the languages in the target text can be accurately counted, so that the accuracy of the subsequent calculation of the language speed can be improved.

Furthermore, when determining the audio duration of the audio file, since the audio file not only includes audio segments of sound production, but also may have non-sound production or non-useful audio segments, if other audio segments not belonging to sound production of the user are also counted into the audio duration when calculating the speech speed, the problem of inaccurate speech speed calculated will be caused, so that useless audio segments in the audio file can be deleted, thereby obtaining a real audio duration to improve the accuracy of the subsequent speech speed calculation, in this embodiment, the specific implementation manner is as follows:

Specifically, when deleting the useless audio clips in the audio file, determining according to the volume amplitude characteristic of the audio file, where the audio amplitude characteristic specifically refers to the energy size of the audio file, and the smaller the energy is, the smaller the corresponding audio clip sound is, the higher the probability of being the useless audio clip is, and accordingly, the silent audio clip is the useless audio clip in the audio file.

Based on the above, after the volume amplitude feature corresponding to the audio file is constructed, the silent audio segment in the audio file can be determined by analyzing the volume amplitude feature, then the silent audio duration of the silent audio segment and the total audio duration of the audio file are determined, and finally the difference between the total audio duration and the silent audio duration is calculated, so that the audio duration of speaking of the user in the audio file can be determined, and the accuracy of the subsequent calculation of the speech speed is improved.

Along the above example, the time used for recording the Chinese and English mixed audio by the user is determined to be 3s, and by constructing the volume amplitude characteristic corresponding to the Chinese and English mixed audio, it is determined that three sentence breaks exist in the Chinese and English mixed audio by the user, and the sentence breaking time periods are respectively 0.2s,01s and 0.1s, then the speaking time period of the user is determined to be 3-0.2-0.1-0.1=2.6 s, namely the audio time period of the Chinese and English mixed audio is determined to be 2.6s.

In addition, when determining the audio duration, the determination may be performed according to a start time point and an end time point of sound generation of the sound source, and in implementation, the manner of determining the audio duration may be selected according to an actual application scenario, which is not limited herein.

In summary, when determining the audio duration of the audio file, by deleting useless audio clips in the audio file, the duration of the real speech of the user in the audio can be accurately determined, so that the accuracy of the subsequent calculation of the speech rate can be effectively improved.

Step S108, calculating the speech speed of the sound source in the audio file based on the target characters and the audio duration which are respectively corresponding to the at least two languages.

Specifically, on the basis of the above determination of the target characters corresponding to each language and the audio duration of the audio file, the speech speed of a sound source in the audio file is calculated based on the audio duration and the number of the target characters corresponding to each language, where the sound source may be a speaking user or a player for playing speaking content.

Further, since the representing modes of different characters in different languages are different, statistics of the number of characters is required according to the languages, so as to improve the accuracy of calculating the speech rate, and in this embodiment, the specific implementation manner is as follows:

Specifically, after determining the number of characters of the target characters corresponding to the at least two languages respectively, summing the number of characters according to the languages at this time to obtain the total number of characters contained in the target text, and then calculating the ratio of the total number of characters to the audio duration to determine the speech speed of the sound source in the audio file.

In the specific implementation, in the process of calculating the speech rate, the speech rate can be obtained by calculating the following formula:

s＝(∑n _cn +∑n _en+……+ ∑n _mn )/d

where s represents speech rate, d represents audio duration, n _mn The number of characters representing each language is calculated by summing the total characters of the user speaking in the audio file and calculating the ratio of the total characters to the audio duration, so that the speech speed of the user speaking in the audio file can be determined.

Along the above example, after determining that the target characters corresponding to chinese are { small, pun, friend, late, superior, good, i, people, so far, day, learning, and }, and the target characters corresponding to english are { greta, animals }, determining that the total number of chinese characters is 13, the total number of english characters is 2, that is, the total number of characters corresponding to chinese-english text is 15, and the audio duration of the chinese-english mixed audio is 2.6s, determining s= (13+2)/2.6x60= 346.20 (characters/min) through calculation, that is, the speech speed of the user speaking is 346.20 (characters/min).

Furthermore, after the speech rate calculation is completed, in order to generate a high-quality audio file which is convenient for the user to listen and meets the listening requirement of the user, the audio file can be adjusted according to the speech rate, and in this embodiment, the specific implementation manner is as follows:

Specifically, the language audio clips specifically refer to audio clips corresponding to each language respectively, the language audio clips are adjusted, specifically, the playing speed of the audio clips is increased or reduced according to actual listening requirements, and therefore target audio clips meeting the listening requirements are obtained.

Along the above example, after obtaining the speaking speed of the user, in order to facilitate the listening of other users listening to the voice, the speed of the chinese audio segment in the chinese-english mixed audio can be appropriately slowed down and the speed of the english audio segment can be appropriately sped up according to the speaking speed, so as to obtain the target chinese-english mixed audio meeting the playing requirement, and then play the target chinese-english mixed audio for listening to other users.

In practical application, when the language audio clips corresponding to the at least two languages respectively are adjusted according to the speech speed, the adjusting speed/slow speed can be selected according to the actual requirement or kept unchanged, so that the audio suitable for the user to listen to can be obtained, and the setting can be performed according to the actual application scene, for example, in a novel reading scene, the playing speed can be increased, or in a teaching scene, the playing speed can be reduced, and the embodiment is not limited in any way.

In conclusion, through adjusting the audio clips of different languages in the audio file according to the speech speed, the audio meeting the listening requirements of other users is obtained, the listening experience of the users can be effectively improved, and the touch rate of the users is improved.

The following describes an example of the application of the audio processing method provided in the present specification in a mixed chinese-english audio recognition scenario, with reference to fig. 4. Fig. 4 shows a processing flow chart of an audio processing method applied to a mixed audio recognition scene of Chinese and English according to an embodiment of the present disclosure, specifically including the following steps:

step S402, obtaining Chinese and English mixed audio.

According to the method, language identification is carried out on the sound pieces containing Chinese and English, and target characters are determined according to the languages, so that the voice speed of a sound source in audio can be accurately calculated, the voice speed identification of mixed audio can be completed, the normal operation of a subsequent audio processing process can be ensured, and the requirement of voice speed accuracy calculation of an audio processing scene is further met.

Step S404, framing the Chinese and English mixed audio to obtain a plurality of audio frames, and determining MFCC features corresponding to the audio frames respectively.

In step S406, MFCC features corresponding to the audio frames are input to the chinese-english speech recognition model for processing, so as to obtain the target text containing the language identifier.

In step S408, the chinese language seed identifier and the english language seed identifier in the language identifier are determined.

Step S410, classifying the target text according to the Chinese language seed identifier and the English language seed identifier to obtain a Chinese text and an English text.

Step S412, identifying chinese characters in the chinese text and english characters in the english text, and determining an audio duration of the mixed chinese-english audio.

In step S414, the number of chinese characters and the number of english characters are determined, and the two are summed to obtain the total number of characters.

And step S416, calculating the ratio of the total character number to the audio duration to obtain the speech speed of the user in the Chinese-English mixed audio.

Specifically, after the speech rate of the user is calculated, in order to generate a high-quality audio file which is convenient for other users to listen and meets the listening requirement of the user, the audio file can be adjusted according to the speech rate; in addition, all details not described in the present embodiment can be referred to the corresponding descriptions in the above embodiments, and the details of this embodiment are not repeated here.

The audio processing method provided by the specification realizes the division of a plurality of languages in the audio file by combining the language identifier, can effectively avoid the error of language speed estimation, and can recognize the characters corresponding to different languages by combining the language identifier, thereby further improving the calculation accuracy of the mixed audio speed aiming at multiple languages and being more convenient for the effective proceeding of the subsequent audio processing process.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an audio processing apparatus, and fig. 5 shows a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

an obtaining module 502 configured to obtain an audio file containing at least two languages;

the processing module 504 is configured to determine a feature matrix corresponding to the audio file, and input the feature matrix into the speech recognition model for processing, so as to obtain a target text containing a language identifier;

a determining module 506, configured to determine target characters respectively corresponding to at least two languages included in the target text according to the language identifier, and determine an audio duration of the audio file;

And the calculating module 508 is configured to calculate the speech speed of the sound source in the audio file based on the target characters and the audio duration respectively corresponding to the at least two languages.

In an alternative embodiment, the processing module 504 includes:

a feature encoding unit configured to input the feature matrix to the speech recognition model, perform feature encoding by an encoder in the speech recognition model, and output a feature sequence of the audio file;

a feature decoding unit configured to decode the feature sequence by a decoder in the speech recognition model after the feature sequence is directed to an attention mechanism, and output a target feature sequence of the audio file;

and the output unit is configured to process the target feature sequence through an output layer in the voice recognition model and output the target text containing the language identifier.

In an alternative embodiment, the processing module 504 includes:

the framing processing unit is configured to perform framing processing on the audio file to obtain a plurality of audio frames;

a feature vector determining unit configured to determine feature vectors corresponding to the plurality of audio frames, respectively;

And the characteristic matrix generating unit is configured to generate the characteristic matrix corresponding to the audio file based on the characteristic vectors respectively corresponding to the plurality of audio frames.

In an alternative embodiment, the speech recognition model is trained by:

In an alternative embodiment, the determining module 506 includes:

a language sub-identifier determining unit configured to determine language sub-identifiers respectively corresponding to at least two languages included in the target text among the language identifiers;

The classifying unit is configured to classify the target text according to the language sub-identifiers respectively corresponding to the at least two languages to obtain target sub-texts respectively corresponding to the at least two languages;

and the character recognition unit is configured to recognize characters contained in the target sub-texts corresponding to the at least two languages respectively, and determine the target characters corresponding to the at least two languages respectively according to recognition results.

In an alternative embodiment, the determining module 506 includes:

the audio file processing unit is configured to process the audio file to obtain volume amplitude characteristics corresponding to the audio file, and determine silence audio fragments in the audio file according to the volume amplitude characteristics;

a determining audio duration unit configured to determine a silence audio duration of the silence audio clip, and an audio total duration of the audio file;

and the audio duration calculating unit is configured to calculate the difference value between the total audio duration and the silent audio duration to obtain the audio duration.

In an alternative embodiment, the computing module 508 includes:

a total character number determining unit configured to determine the character number of the target characters corresponding to the at least two languages respectively, and sum the character numbers of the target characters corresponding to the at least two languages respectively to obtain a total character number;

And the voice speed calculating unit is configured to calculate the ratio of the total character number to the audio duration to obtain the voice speed of the sound source in the audio file.

In an alternative embodiment, the audio processing apparatus further includes:

the language audio fragment determining module is configured to determine language audio fragments corresponding to the at least two languages in the audio file respectively;

and the adjusting module is configured to adjust the language audio clips corresponding to the at least two languages respectively according to the language speed, and generate a target audio file according to an adjustment result.

In an alternative embodiment, the determining the feature vector unit includes:

the windowing processing subunit is configured to perform windowing processing on the plurality of audio frames and construct a first frequency spectrum corresponding to the plurality of audio frames according to the windowing processing result;

the frequency spectrum conversion subunit is configured to convert the first frequency spectrum into a second frequency spectrum through a preset filter bank, and perform cepstrum processing on the first frequency spectrum to obtain feature vectors corresponding to the plurality of audio frames respectively.

According to the audio processing device provided by the embodiment, after the audio file containing at least two languages is obtained, the feature matrix corresponding to the audio file is determined, then the feature matrix is input into the voice recognition model for processing, the target text containing the language identifiers is obtained, the multiple languages in the audio file are accurately divided by combining the language identifiers, then the target characters corresponding to the different languages are determined according to the language identifiers, meanwhile, the audio duration of the audio file is determined, finally, the language speed of a sound source in the audio file is calculated based on the target characters and the audio duration corresponding to the different languages, errors of language speed estimation can be effectively avoided, the characters corresponding to the different languages are identified by combining the language identifiers, the calculation accuracy of the mixed audio language speed for multiple languages is further improved, and the effective implementation of a subsequent audio processing process is more convenient.

The above is a schematic solution of an audio processing apparatus of the present embodiment. It should be noted that, the technical solution of the audio processing apparatus and the technical solution of the audio processing method belong to the same concept, and details of the technical solution of the audio processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the audio processing method.

Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with an embodiment of the present specification. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute the following computer-executable instructions:

acquiring an audio file containing at least two languages;

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the audio processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the audio processing method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to:

acquiring an audio file containing at least two languages;

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the audio processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the audio processing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present description is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present description. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, to thereby enable others skilled in the art to best understand and utilize the disclosure. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An audio processing method, comprising:

acquiring an audio file containing at least two languages;

determining a feature matrix corresponding to the audio file, inputting the feature matrix into a voice recognition model for processing to obtain a target text containing a language identifier, wherein the step of inputting the feature matrix into the voice recognition model for processing to obtain the target text containing the language identifier comprises the steps of inputting the feature matrix into the voice recognition model, performing feature coding through an encoder in the voice recognition model, and outputting a feature sequence of the audio file; after the feature sequence is introduced into an attention mechanism, decoding is carried out through a decoder in the voice recognition model, and a target feature sequence of the audio file is output; processing the target feature sequence through an output layer in the voice recognition model, and outputting the target text containing the language identifier;

calculating the speech speed of a sound source in the audio file based on the target characters respectively corresponding to the at least two languages and the audio time length, wherein the calculating the speech speed of the sound source in the audio file based on the target characters respectively corresponding to the at least two languages and the audio time length comprises determining the character numbers of the target characters respectively corresponding to the at least two languages, and summing the character numbers of the target characters respectively corresponding to the at least two languages to obtain the total character number; and calculating the ratio of the total character number to the audio duration to obtain the speech speed of the sound source in the audio file.

2. The audio processing method according to claim 1, wherein the determining the feature matrix corresponding to the audio file includes:

3. The audio processing method according to claim 1, wherein the speech recognition model is trained by:

4. The audio processing method according to claim 1, wherein said determining, according to the language identifier, target characters respectively corresponding to at least two languages included in the target text includes:

5. The audio processing method of claim 1, wherein the determining the audio duration of the audio file comprises:

6. The audio processing method according to claim 1, wherein after the step of calculating the speech rate of the sound source in the audio file based on the target character and the audio duration corresponding to the at least two languages, respectively, is performed, the method further comprises:

7. The audio processing method according to claim 2, wherein the determining feature vectors to which the plurality of audio frames respectively correspond includes:

and converting the first frequency spectrum into a second frequency spectrum through a preset filter bank, and performing cepstrum processing on the second frequency spectrum to obtain the feature vectors corresponding to the plurality of audio frames respectively.

8. An audio processing apparatus, comprising:

The processing module is configured to determine a feature matrix corresponding to the audio file, input the feature matrix into a voice recognition model for processing to obtain a target text containing a language identifier, wherein the processing of inputting the feature matrix into the voice recognition model to obtain the target text containing the language identifier comprises inputting the feature matrix into the voice recognition model, performing feature encoding through an encoder in the voice recognition model, and outputting a feature sequence of the audio file; after the feature sequence is introduced into an attention mechanism, decoding is carried out through a decoder in the voice recognition model, and a target feature sequence of the audio file is output; processing the target feature sequence through an output layer in the voice recognition model, and outputting the target text containing the language identifier;

the computing module is configured to compute the speech speed of the sound source in the audio file based on the target characters respectively corresponding to the at least two languages and the audio time length, wherein the computing the speech speed of the sound source in the audio file based on the target characters respectively corresponding to the at least two languages and the audio time length comprises determining the number of characters of the target characters respectively corresponding to the at least two languages, and summing the number of characters of the target characters respectively corresponding to the at least two languages to obtain the total number of characters; and calculating the ratio of the total character number to the audio duration to obtain the speech speed of the sound source in the audio file.

9. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method of:

acquiring an audio file containing at least two languages;

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the audio processing method of any one of claims 1 to 7.