CN113823287A

CN113823287A - Audio processing method, device and computer readable storage medium

Info

Publication number: CN113823287A
Application number: CN202110872240.0A
Authority: CN
Inventors: 马应龙; 索郎王修
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-12-21

Abstract

The application provides an audio processing method, an audio processing device and a computer readable storage medium, which relate to the technical field of computers, and the method comprises the following steps: acquiring audio to be processed, wherein the audio to be processed comprises one or more audio frames; for any audio frame in the one or more audio frames, performing segmentation processing on the any audio frame to obtain a plurality of audio segments, determining the audio category of each audio segment in the plurality of audio segments, and determining the voice recognition result of the any audio frame according to the audio category of each audio segment; according to the voice recognition result of each audio frame, eliminating the audio frames with the voice recognition results as target recognition results in the audio to be processed to obtain processed audio; and performing style conversion processing on the processed audio to obtain a target audio. By the embodiment of the application, the accuracy of the voice related to the audio style conversion can be improved.

Description

Audio processing method, device and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, and a computer-readable storage medium.

Background

The audio is an important medium in multimedia, the voice in the audio is a sound which is emitted by human beings through a pronunciation organ, has a certain meaning and is used for social interaction, and the style conversion of the audio refers to the conversion of the language type of the voice in the audio, for example, the language type of the voice in the audio is a tibetan dialect, and the language type of the voice in the audio can be converted into a kanba dialect.

Some abnormal voices, such as humming, hesitation, laughing, and yelling, generally appear in the audio, which causes a problem of poor accuracy of the audio during style conversion, that is, before and after the style conversion, text information corresponding to the voices in the audio changes, for example, a text corresponding to the voices in the original audio asks "you go", and a text corresponding to the voices in the audio after the style conversion asks "you go to eat", and at this time, although the problem of language obstruction can be solved by the style conversion, the contents cannot be correctly expressed, so it is necessary to improve accuracy of the voices involved in the audio style conversion.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device and a computer readable storage medium, which can improve the accuracy of voice related to audio style conversion.

In one aspect, an embodiment of the present application provides an audio processing method, where the method includes:

acquiring audio to be processed, wherein the audio to be processed comprises one or more audio frames;

for any audio frame in the one or more audio frames, performing segmentation processing on the any audio frame to obtain a plurality of audio segments, determining the audio category of each audio segment in the plurality of audio segments, and determining the voice recognition result of the any audio frame according to the audio category of each audio segment;

according to the voice recognition result of each audio frame, eliminating the audio frames with the voice recognition results as target recognition results in the audio to be processed to obtain processed audio;

and performing style conversion processing on the processed audio to obtain a target audio.

In another aspect, an embodiment of the present application provides an audio processing apparatus, where the apparatus includes:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring audio to be processed, and the audio to be processed comprises one or more audio frames;

the processing module is used for segmenting any one of the one or more audio frames to obtain a plurality of audio segments, determining the audio category of each of the plurality of audio segments, and determining the voice recognition result of any one of the audio frames according to the audio category of each of the plurality of audio segments;

the processing module is further used for removing the audio frames with the voice recognition results as the target recognition results in the audio to be processed according to the voice recognition results of the audio frames to obtain processed audio;

and the processing module is also used for carrying out style conversion processing on the processed audio to obtain a target audio.

Accordingly, an embodiment of the present application provides a computer device, which includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores a computer program, and the processor is configured to invoke the computer program to execute the audio processing method according to any one of the foregoing possible implementation manners.

Accordingly, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the processor executes the computer program related to the audio processing method according to any one of the foregoing possible implementations.

Accordingly, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the audio processing method according to any one of the possible implementation manners.

In the embodiment of the application, firstly, any audio frame in one or more audio frames included in audio to be processed is segmented to obtain a plurality of audio segments, the audio category of each audio segment in the plurality of audio segments is determined, the voice recognition result of any audio frame is determined according to the audio category of each audio segment, then the audio frames with the voice recognition result as the target recognition result in the audio to be processed are removed according to the voice recognition result of each audio frame to obtain processed audio, and finally the processed audio is subjected to style conversion processing to obtain the target audio; the audio processing method can remove the non-voice audio in the audio to be processed, thereby reducing external interference and improving the audio quality of the audio, and being beneficial to improving the accuracy of the voice related to the audio style conversion.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an audio processing system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another audio processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of determining audio category according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model structure of an x-vector model provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of another audio processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a speech recognition technique provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a speech synthesis technique provided by an embodiment of the present application;

FIG. 9 is a process diagram of a speech recognition technique according to an embodiment of the present application;

fig. 10 is a schematic flowchart of another audio processing method according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to improve the accuracy of audio style conversion, the embodiment of the application provides an audio processing method based on cloud technology and artificial intelligence skills.

Cloud technology (Cloud technology) is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on Cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

An Artificial Intelligence (AI) technology is a comprehensive subject, and relates to a wide range of fields, namely a hardware technology and a software technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chip cloud computing, cloud storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With research and progress of cloud technologies and artificial intelligence technologies, the cloud technologies and the artificial intelligence technologies are researched and applied in multiple fields, and in the process of realizing audio style conversion, the embodiments of the present application relate to cloud computing technologies of the cloud technologies, technologies of artificial intelligence, machine learning, and the like, and are specifically described with the following embodiments.

The audio annotation is to convert the voice appearing in the audio into characters, and is one of the works of the annotation staff. Even if the same text system is used, the difference of languages is caused due to different regions, for example, under the Tibetan system, three main languages are included: before a annotator converts the voice appearing in the audio into characters, because the language generally mastered by the annotator is limited, for example, the annotator cannot simultaneously master the Tibetan dialect, the Kamba dialect and the Anduo dialect, the language type of the voice in the audio needs to be determined before converting the voice appearing in the audio into the characters, and the corresponding annotator executes audio annotation work, for example, the language type of the voice in the audio is the Tibetan dialect, and the annotator mastering the Tiaoh dialect executes the audio annotation work. Therefore, the annotating personnel can not mark across languages, the selectivity of the annotating personnel when the audio annotation is carried out is low, the output magnitude of the audio annotation is unbalanced, meanwhile, how the annotating personnel drops into the audio annotation is needed to be considered, style conversion can be carried out on the audio, the language type of the audio is changed into the language type which is good for the user no matter what the language type of the audio is, the annotating personnel can convert the language type of the voice in the audio into the language type which is good for the user, the audio annotation is carried out, the problems that cross-language annotation is difficult and the output magnitude is unbalanced due to language barriers can be solved, unreasonable input of the annotating personnel and the like are solved, the progress and delivery output of the audio annotation can be effectively controlled, and the progress management and the input cost are reduced.

Referring to fig. 1, fig. 1 is a schematic diagram of an audio processing system according to an embodiment of the present disclosure. The audio processing system may specifically include a terminal device 101 and a server 102, where the terminal device 101 and the server 102 are connected through a network, for example, a wireless network connection.

Terminal equipment 101 is also referred to as a terminal (Termina), User Equipment (UE), access terminal, subscriber unit, mobile device, user terminal, wireless communication device, user agent, or user equipment. The terminal device may be a smart tv, a handheld device (e.g., a smart phone, a tablet computer) with a wireless communication function, a computing device (e.g., a Personal Computer (PC)), an in-vehicle device, a wearable device, or other smart devices, but is not limited thereto.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In an embodiment, the terminal device 102 may send the audio to be processed to the server 102, the server 102 obtains the audio to be processed, for any audio frame of one or more audio frames included in the audio to be processed, performs segmentation processing on any audio frame to obtain a plurality of audio segments, obtains an audio category of each of the plurality of audio segments, determines a speech recognition result of any audio frame according to the audio category of each audio segment, and according to the speech recognition result of each audio frame, removes the audio frame whose speech recognition result is a target recognition result in the audio to be processed to obtain a processed audio, and performs style conversion processing on the processed audio to obtain a target audio, and according to this embodiment, may determine a speech recognition result of any audio frame according to the audio categories of the plurality of audio segments included in any audio frame, and non-voice audio in the audio to be processed is screened according to the voice recognition result of each audio frame, so that the audio quality of the audio can be improved, the style conversion of the audio can be accurately realized, a marking person can hear the audio which is not different from the text content of the original audio when carrying out cross-language audio marking, and the problem of difficult cross-language marking can be solved.

It should be understood that the architecture diagram of the system described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

Fig. 2 shows an audio processing method provided by the audio processing system of fig. 1 according to an embodiment of the present application. Take the example of application to the server 102 mentioned in fig. 1. The method of the embodiments of the present application is described below with reference to fig. 2.

S201, obtaining audio to be processed, wherein the audio to be processed comprises one or more audio frames.

The audio to be processed refers to audio needing style conversion, and the style conversion refers to converting the language type of the speech in the audio, for example, the language type of the speech in the audio is tibetan dialect, and the language type of the speech in the audio can be converted into the kangba dialect. The audio frame is obtained by segmenting the audio to be processed.

In an embodiment, the server may perform segmentation processing on the audio to be processed by using a timing segmentation manner to obtain one or more audio frames, for example, performing segmentation on the audio to be processed once every 5 seconds of playing the audio to be processed; or, an equally dividing mode may be adopted to perform segmentation processing on the audio to be processed to obtain one or more audio frames, for example, if the audio to be processed is segmented into 4 audio frames, and the playing time of the audio to be processed is 20 seconds, each of the obtained 4 audio frames is played for 5 seconds; the audio processing method may further include obtaining audio waveform data (e.g., a time domain graph, a frequency domain graph, or a spectrogram) of the audio to be processed, and performing segmentation processing on the audio to be processed according to the audio waveform data to obtain one or more audio frames, for example, segmenting the audio to be processed where the frequency in the audio waveform data is lower than a frequency threshold (e.g., the frequency threshold is 150 hz) to obtain one or more audio frames.

S202, aiming at any audio frame in the one or more audio frames, carrying out segmentation processing on the any audio frame to obtain a plurality of audio segments, determining the audio category of each audio segment in the plurality of audio segments, and determining the voice recognition result of the any audio frame according to the audio category of each audio segment.

The audio segment is obtained by segmenting the audio frame, the audio category refers to the nature of the audio, for example, the nature of the audio can be divided into speech, laughter, song, pure music, and noise (e.g., murmur), then the audio category can be speech, laughter, song, pure music, and noise, the audio category can also be normal speech and abnormal speech, the normal speech refers to the sound that is usually required to be labeled by the labeling person when performing audio labeling, for example, when the nature of the audio segment is speech (e.g., dialog, reading aloud), since the speech usually needs to be audio labeled, the audio segment is normal speech, the abnormal speech refers to the sound that is usually not required to be labeled by the labeling person when performing audio labeling, for example, when the nature of the audio segment is laughter, song, pure music, and noise, usually audio labeling is not required, the audio piece is therefore abnormal speech.

The speech recognition results include two types: the audio is usually the sound (e.g. dialogue, reading) of normal speaking included in the audio, and is the audio that the annotating person usually needs to annotate when the audio is annotated, the non-speech audio is the sound (e.g. laugh, song, pure music, noise) of abnormal speaking included in the audio, and is the audio that the annotating person usually does not need to annotate when the audio is annotated, so the speech recognition result can also reflect the nature of the audio, for example, the recognition result of the audio frame is the audio that the annotating person needs to perform audio annotation when the audio frame is speech audio, and the recognition result of the audio frame is the audio that the annotating person does not need to perform audio annotation when the audio frame is non-speech audio.

In an embodiment, when the server obtains a speech recognition result of any one of one or more audio frames included in the audio to be processed, the server needs to segment any one of the audio frames to obtain a plurality of audio segments included in any one of the audio frames, and the server may obtain audio waveform data (e.g., a time domain map, a frequency domain map, or a spectrogram) of any one of the audio frames, and segment any one of the audio frames by using the audio waveform data of any one of the audio frames to obtain one or more audio segments.

Optionally, the server may perform segmentation processing on any video frame by using an audio silence interval (a time length segment corresponding to an amplitude of zero or near zero) in the audio waveform data of any audio frame, for example, if the audio waveform data is a time domain graph, the time domain graph is a two-dimensional graph in which the audio amplitude changes with time, that is, an abscissa in the time domain graph is a playing time length, and an ordinate is an amplitude (that is, a sound intensity), then the playing time interval corresponding to the amplitude of zero or near zero in the time domain graph may be determined as an audio silence interval, and then the audio silence interval is used as a segmentation point, and the segmentation processing is performed on any audio frame to obtain a plurality of audio segments, where the audio silence interval is also one audio segment.

Optionally, the server may further perform a slicing process on any video frame at a low frequency (frequency lower than a frequency threshold) in the audio waveform data of any audio frame to obtain a plurality of audio segments.

In an embodiment, when the server determines the speech recognition result of any audio frame by using the audio category of each of the multiple audio segments included in any audio frame, the server may obtain the proportion of the audio segment when the audio category in any audio frame is a target category (e.g., normal speech or speech), and determine the speech recognition result of any audio frame according to the proportion, for example, when the proportion of the audio segment when the audio category in any audio frame is normal speech is greater than or equal to 50%, the speech recognition result of any audio frame is speech audio, when the proportion of the audio segment when the audio category is normal speech is less than 50%, the speech recognition result of any audio frame is non-speech audio, assuming that any audio frame includes 5 audio segments, and when the audio category of 3 audio segments among the 5 audio segments is normal speech, the proportion of the audio segment when the audio category is normal speech is 60%, the voice recognition result of any audio frame is a voice audio, the audio category of 2 audio segments in the 5 audio segments is a normal voice, and when the audio category is a normal voice, the percentage of the audio segments is 40%, the voice recognition result of any audio frame is a non-voice audio.

S203, according to the voice recognition result of each audio frame, eliminating the audio frame with the voice recognition result as the target recognition result in the audio to be processed to obtain the processed audio.

The target recognition result means that the voice recognition result is a non-voice audio, when the voice recognition result of the audio frame is the target recognition result, that is, when the voice recognition result is the non-voice audio, it indicates that the audio frame includes abnormal speaking sounds (such as abnormal voices, laughter, songs, pure music, and noises), and the annotating person does not need to perform audio annotation on the audio frame.

In an embodiment, the server obtains the voice recognition result of each audio frame, and removes the audio frame from the audio to be processed when the voice recognition result of the audio frame is determined to be the target recognition result.

And S204, performing style conversion processing on the processed audio to obtain a target audio.

The style conversion processing after the processing refers to converting the language type of the voice in the audio, for example, the language type of the voice in the audio is a tibetan dialect, and the language type of the voice in the audio can be converted into a kangba dialect.

In an embodiment, the server may perform style conversion processing on the processed audio to obtain the target audio, for example, obtain text information of the processed audio by using a speech recognition technology, and then process the text information of the processed audio by using a speech synthesis technology to obtain the target audio. Due to the fact that the non-voice audio in the audio is removed, the obtained text information of the audio is more accurate, and the audio synthesized by the voice is more accurate.

The Speech Recognition (Speech Recognition) technology is a process of recognizing voices of others to obtain Text information, and the Text-to-Speech (TTS) technology is a process of converting any Text information (such as help files or web pages) into standard smooth voices in real time to be read out.

In the embodiment of the application, a server obtains audio to be processed, any one of one or more audio frames included in the audio to be processed is segmented to obtain a plurality of audio segments, the audio category of each audio segment in the plurality of audio segments is determined, the voice recognition result of any audio frame is determined according to the audio category of each audio segment, the audio frame of which the voice recognition result in the audio to be processed is the target recognition result is removed according to the voice recognition result of each audio frame to obtain the processed audio, the processed audio is subjected to style conversion processing to obtain the target audio, the embodiment can remove non-voice audio in the audio to be processed by using the voice recognition result of each audio frame included in the audio to be processed, can reduce external interference, improve the audio quality of the audio, and is beneficial to improving the accuracy of voice related to audio style conversion, meanwhile, the audio after the style conversion has higher accuracy, so that the annotating personnel can more accurately realize the cross-language audio annotation.

Fig. 3 shows another audio processing method provided by the audio processing system of fig. 1 according to the embodiment of the present application. Take the example of application to the server 102 mentioned in fig. 1. The method of the embodiments of the present application is described below with reference to fig. 3.

S301, obtaining audio to be processed, wherein the audio to be processed comprises one or more audio frames.

For specific implementation of S301, reference may be made to related description of S201 in the foregoing embodiment, which is not described herein again.

S302, aiming at any audio frame in the one or more audio frames, carrying out segmentation processing on the any audio frame to obtain a plurality of audio segments.

In an embodiment, a server performs segmentation processing on any audio frame to obtain a plurality of audio segments, including the following steps:

(1) audio waveform data for any one audio frame is acquired.

(2) According to the audio waveform data, an audio silence interval in any audio frame is determined as a slicing point.

(3) And carrying out segmentation processing on any audio frame according to the determined segmentation points to obtain a plurality of audio segments.

Optionally, the server may obtain audio waveform data (e.g., a time domain map, a frequency domain map, or a spectrogram) of any audio frame, determine an audio silence interval (a time length segment corresponding to an amplitude of zero or near zero) in any audio frame according to the audio waveform data, and segment any audio frame using the audio silence interval as a segmentation point to obtain a plurality of audio segments, where the audio silence interval is also an audio segment.

S303, determining the audio category of each audio clip in the plurality of audio clips.

As shown in fig. 4, when determining the audio category of each of a plurality of audio clips included in an audio frame, a server first performs feature extraction on any one of the plurality of audio clips to obtain a speech feature of any one of the audio clips.

The speech feature is an acoustic feature, and the acoustic feature refers to a physical quantity representing acoustic characteristics of audio, for example, the speech feature may be a Mel Frequency Cepstrum Coefficient (MFCC), an fbank (filter bank) feature, or a Linear Prediction Coefficient (LPC). Any audio clip is digital audio, that is, any audio clip is a digital audio signal composed of binary 1 or 0, which is convenient for a machine to process.

And then the server processes the voice characteristics of any audio segment by using a characteristic processing module of the audio classification model to obtain the voice characteristic vector of any audio segment.

The feature processing module is configured to extract a speech feature vector, and may be a machine learning Model, such as a Gaussian Mixture Model (GMM), a Time-Delay Neural Network (TDNN), a Gaussian Mixture Model-Background Model (UBM), an x-vector Model, and the like.

Optionally, the application adopts an x-vector model in the field of speaker recognition (i.e. recognizing who the speaker is) to process the voice feature of any audio segment to obtain the voice feature vector of any audio segment, wherein, the x-vector model is shown in fig. 5, and comprises a frame processing layer, a statistical pooling layer and a segment processing layer, wherein the frame processing layer comprises 5 layers of TDNN, the statistical pooling layer calculates the average value and the standard deviation of the output of the frame processing layer, the output mean value and standard deviation are spliced, a segment processing layer extracts segment-level vectors through 2 layers of forward DNN (Deep Neural Network) to represent the speaker, any layer of the 2 layers of DNN can be used as a voice feature vector of any audio segment in the application, for example, speech feature vector 1 and speech feature vector 2 in fig. 5 can be used as the speech feature vector of any audio segment.

And finally, the server processes the voice feature vector of any audio segment by using a classification processing module of the audio classification model to obtain the audio category of any audio segment.

The classification processing module is configured to perform a classification task, and may be a classification model, for example, a Probabilistic Linear Discriminant Analysis (LDAP), a Logistic Regression (LR), a Support Vector Machine (SVM), or the like.

In an embodiment, the server may obtain the speech feature vectors of the multiple audios by using the x-vector model, train parameters of the initial classification model by using the speech feature vectors of the multiple audios, and after the training is completed, use the trained classification model as a classification processing module in the audio classification model.

In an embodiment, the server processes the voice feature vector of any audio segment by using the classification processing module of the audio classification model, so as to obtain the probability that the audio segment belongs to each audio category, for example, the audio categories include voice, laughter, song, pure music, and noise, the probabilities that the audio segment belongs to voice, laughter, song, pure music, and noise are 0.5, 0.1, 0.2, 0, and 0.2, respectively, and the server can use the audio category corresponding to the maximum probability as the audio category of the audio segment, that is, the audio category of the audio segment is voice.

S304, according to the audio category of each audio clip, determining the proportion of the audio clip of which the audio category is the target category in any audio frame, and determining the voice recognition result of any audio frame according to the proportion.

The target category is an audio category which needs to be labeled when the labeling personnel performs audio labeling, for example, the audio category comprises normal voice and abnormal voice, and the target category can be normal voice; alternatively, the audio category may include speech, laughter, song, pure music, noise, and the target category may be speech.

In an embodiment, the server obtains a proportion of audio segments of which the audio class is a target class in any audio frame according to the audio class of each of a plurality of audio segments included in any audio frame, and determines a speech recognition result of any audio frame according to the proportion, for example, the audio class includes speech, laughter, song, pure music, and noise, the target class is speech, and assuming that any audio frame includes 5 audio segments, and the audio class of 3 of the 5 audio segments is speech, the proportion of the audio segments of which the audio class is the target class in any audio frame is 60%; or the audio category includes normal voice and abnormal voice, the target category is normal voice, any audio frame includes 5 audio segments, and the audio category of 3 audio segments in the 5 audio segments is normal voice, so that the percentage of the audio segment with the target category in any audio frame is 60%. The server may set a proportion threshold (which may be set manually), where the proportion threshold is a percentage value, and when the obtained proportion is smaller than the proportion threshold, for example, the proportion threshold is 70% and the proportion is 60%, at this time, it may be determined that the speech recognition result of any audio frame is the target recognition result, that is, it may be determined that any audio frame is a non-speech audio and is an audio that is not required to be labeled by a labeling person.

In an embodiment, when determining the speech recognition result of any audio frame according to the occupation ratio, the server may further obtain a predicted value that any audio frame is speech, obtain a reference probability that any audio frame is speech audio by using the occupation ratio and the predicted value, and when the reference probability is smaller than a probability threshold, determine the speech recognition result of any audio frame as a target recognition result, that is, determine that any audio frame is non-speech audio and is audio that a annotator does not need to annotate.

The predicted value that the audio frame is the voice refers to the predicted probability that the audio category of the audio frame is the voice, and when the server obtains the predicted value of the voice of the audio frame, the audio frame can be divided into two categories: one is speech and one is non-speech. It should be noted that the speech in this embodiment is different from the speech in the audio categories of the previous embodiments in nature, and when the audio category of the audio frame is speech, the speech includes the speech, laughing, song, pure music and noise in the aforementioned audio categories, and the sounds other than the speech, laughing, song, pure music and noise in the aforementioned audio categories are regarded as non-speech. When the audio category of the audio frame is speech, the speech also includes normal speech and abnormal speech in the audio category, and sounds other than normal speech and abnormal speech in the audio category are regarded as non-speech.

In an embodiment, a predicted value of any audio frame being speech may be treated as a weight value to obtain a reference probability of the audio frame being speech audio, for example, any audio frame being predicted value 0.8 of speech, the occupation ratio being 0.6, and the predicted value and the occupation ratio value may be multiplied as a reference probability, for example, the predicted value 0.8 is multiplied by the occupation ratio 0.6 to obtain a reference probability 0.48. When the reference probability is smaller than the probability threshold (which may be set manually), for example, the probability threshold is set to 0.5, and at this time, the reference probability is smaller than the probability threshold, and the server determines the speech recognition result of any audio frame as the target recognition result, that is, determines that any audio frame is a non-speech audio, which is an audio that is not required to be labeled by the labeling person.

S305, according to the voice recognition result of each audio frame, eliminating the audio frame with the voice recognition result as the target recognition result in the audio to be processed to obtain the processed audio.

For specific implementation of S305, reference may be made to the related description of S203 in the foregoing embodiment, which is not described herein again.

S306, carrying out style conversion processing on the processed audio to obtain a target audio.

The target language is a language type of the voice in the audio subjected to the style conversion, for example, the language type of the target audio is the kangba dialect, and the target language is the kangba dialect.

In an embodiment, as shown in fig. 6, the server may call the voice recognition interface, acquire the voice recognition model to perform voice recognition processing on the processed audio to obtain text information of the processed audio, and call the voice synthesis interface to acquire a voice synthesis model corresponding to a target language (for example, a kabat voice synthesis model corresponding to a kabat dialect) to perform voice synthesis processing on the text information of the processed audio to obtain the target audio.

The speech recognition model may be obtained based on a speech recognition technique, as shown in fig. 7, which is a schematic diagram of the speech recognition technique, the speech recognition technique first performs preprocessing on the processed audio, such as filtering, framing, extracting speech features, and the like, and then the decoder decodes the audio by using the acoustic model language model, the speech data, the text data, and the pronunciation dictionary to obtain the most likely word sequence, thereby obtaining the text information of the processed audio. The speech recognition model may also be a model that is trained end-to-end based on deep learning.

The speech synthesis model can be obtained based on a speech synthesis technology, as shown in fig. 8, which is a schematic diagram of the speech synthesis technology, and the core idea is to store natural speech waveforms to form a large-scale sound library, and then perform text analysis on the processed audio text information, thereby selecting appropriate waveforms from the large-scale sound library to splice together. The speech synthesis model may also be a model obtained by end-to-end training based on deep learning.

In an embodiment, when a user can click an "online recognition" function key on a display interface of a terminal device, an intelligent terminal obtains a processed audio and invokes a voice recognition interface, so as to perform voice recognition processing on the processed audio by using a voice recognition model to obtain text information of the processed audio, as shown in fig. 9, after the intelligent terminal invokes a Tibetan voice recognition interface to perform voice recognition processing on a Tibetan audio, the text information can be added into a text box area below an audio band, when the user clicks a "text-to-voice" function key again, a "please select a target language" can be displayed on the display interface, for example, 3 sub-items of "Tibetan-Tibet", "Tibetan-Anduo" and "Tibetan-Kanba" can be displayed, one of the sub-items is selected by the user as the target language, the intelligent terminal accesses a corresponding voice synthesis interface according to the target language selected by the user, therefore, the voice synthesis model corresponding to the target language is called to perform voice synthesis processing on the text information of the processed audio to obtain the target audio, if the target language is the Kangba dialect, the text information can be converted into the audio of the Kangba dialect by the voice synthesis model, the server can play the target audio on the display interface of the intelligent terminal, and meanwhile, the target audio is stored, so that a user can listen to the target audio repeatedly.

In one embodiment, the server may obtain a basic pronunciation vocabulary, which includes mapping relationships between words and one or more language pronunciations, for example, as shown in table 1 below, which is a partial example of the basic pronunciation vocabulary of Tibetan language including mapping relationships between common Tibetan language and Tibet pronunciations, Anduo pronunciations, and Kamba pronunciations, and the server may perform speech recognition processing on the processed audio to obtain a language pronunciation of each participle in the processed audio, wherein the language pronunciation of the participle obtained by the server may be a language pronunciation of one word, such as the Mandarin pronunciation "xi" of "West", or a language pronunciation of one word, such as the common pronunciation "xi" of "Xian"The speech pronunciation "xi an" may also be a language pronunciation of a sentence, for example, "xi an shi gu du" which is a mandarin pronunciation of "xi an is ancient", after the server obtains the language pronunciation of each participle in the processed audio, the server may obtain the text information of the processed audio by using the basic pronunciation word table, for example, the language pronunciation of the participle included in the processed audio is "gafkaf", and the text information of the processed audio is "gafkaf

TABLE 1

In an embodiment, after acquiring the text information of the processed audio by using the basic pronunciation vocabulary, the server may acquire the target audio according to the target language, the text information of the processed audio and the basic pronunciation vocabulary. For example, the text information of the processed audio is

If the target language is Kamba's language, Kamba's pronunciation corresponding to the text information can be obtained from the basic pronunciation vocabulary, for example

The corresponding conba pronunciation is "g-a k-a", and the target audio can be generated according to the conba pronunciation.

In one embodiment, considering that different pronunciations exist for the same character even in the same language type, the application can establish a special pronunciation vocabulary based on which characters have mapping relation with multiple pronunciation in the same language type, that is, multiple pronunciations can exist for the same character in the same language type at the same time, so that the accuracy of speech recognition and speech synthesis can be improved.

As shown in fig. 10, the server may first perform a preprocessing on the Tibetan audio in the manner of the foregoing steps S201, S202, and S203, or S301, S302, S303, and S304, which is not described in detail in this embodiment. Then, the server can obtain the text information in the Tibetan audio, such as the audio included in the video, by using the Tibetan language recognition technology, so as to obtain the voice recognition result. Finally, the server can generate the target audio based on the text information in the Tibetan audio by utilizing the Tibetan language synthesis technology, the Tibetan audio can be converted into multiple language types to be played in the embodiment, the independent selectivity of the audio is realized, and the audio annotation can be carried out without judging the language type of the audio by the annotation personnel, so that the difficulty of cross-language annotation is reduced, and the non-differentiation of the language type is realized.

In the application, the server divides any audio frame in one or more audio frames included in the audio to be processed to obtain a plurality of audio segments, determines the audio category of each audio segment in the plurality of audio segments, determines the voice recognition result of any audio frame according to the audio category of each audio segment, eliminates the audio frame of which the voice recognition result in the audio to be processed is the target recognition result according to the voice recognition result of each audio frame to obtain the processed audio, and finally performs the style conversion processing on the processed audio to obtain the target audio, so that the non-voice audio in the audio to be processed can be eliminated, the audio quality is improved, and compared with the audio directly subjected to the style conversion, the accuracy of the style conversion can be improved, and meanwhile, the audio subjected to the style conversion has higher accuracy, so that the annotator can also more accurately realize the cross-language audio annotation.

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly. Referring to fig. 11, fig. 11 is a schematic structural diagram of an audio processing apparatus according to an exemplary embodiment of the present application, where the apparatus 110 may include:

an obtaining module 1101, configured to obtain a to-be-processed audio, where the to-be-processed audio includes one or more audio frames;

a processing module 1102, configured to segment any audio frame of the one or more audio frames to obtain multiple audio segments, determine an audio category of each audio segment of the multiple audio segments, and determine a speech recognition result of the any audio frame according to the audio category of each audio segment;

the processing module 1102 is further configured to remove, according to the speech recognition result of each audio frame, an audio frame in which the speech recognition result in the audio to be processed is the target recognition result, so as to obtain a processed audio;

the processing module 1102 is further configured to perform style conversion processing on the processed audio to obtain a target audio.

In an embodiment, the processing module 1102 is specifically configured to:

performing feature extraction on any audio clip in the multiple audio clips to obtain a voice feature of the any audio clip;

processing the voice characteristics of any audio segment by using a characteristic processing module of an audio classification model to obtain a voice characteristic vector of any audio segment;

and processing the voice feature vector of any audio segment by using the classification processing module of the audio classification model to obtain the audio category of any audio segment.

In an embodiment, the processing module 1102 is specifically configured to:

determining the proportion of the audio clips of which the audio categories are target categories in any audio frame according to the audio categories of each audio clip;

and determining a voice recognition result of any audio frame according to the ratio.

In an embodiment, the processing module 1102 is specifically configured to:

and when the ratio is smaller than a ratio threshold value, determining that the voice recognition result of any audio frame is the target recognition result, wherein the target recognition result is used for indicating that any audio frame is non-voice audio.

In an embodiment, the processing module 1102 is specifically configured to:

for any audio frame, determining that the audio frame is a predicted value of voice;

wherein the determining a speech recognition result of any audio frame according to the proportion comprises:

determining the reference probability of any audio frame as a voice audio according to the occupation ratio and the predicted value;

and when the reference probability is smaller than a probability threshold value, determining that the voice recognition result of any audio frame is the target recognition result, wherein the target recognition result is used for indicating that any audio frame is non-voice audio.

In an embodiment, the processing module 1102 is specifically configured to:

performing voice recognition processing on the processed audio to obtain text information of the processed audio;

and determining a target language, and performing voice synthesis processing on the text information of the processed audio according to the target language to obtain the target audio.

In an embodiment, the processing module 1102 is specifically configured to:

acquiring a basic pronunciation word list, wherein the basic pronunciation word list comprises a mapping relation between characters and pronunciations of one or more languages;

performing voice recognition processing on the processed audio, and determining language pronunciation of each word in the processed audio;

and determining the text information of the processed audio according to the language pronunciation of each participle and the basic pronunciation word list.

In an embodiment, the processing module 1102 is specifically configured to:

acquiring audio waveform data of any audio frame;

according to the audio waveform data, determining an audio silence interval in any audio frame as a segmentation point;

and carrying out segmentation processing on any audio frame according to each determined segmentation point to obtain a plurality of audio segments.

In the embodiment of the application, the server acquires the audio to be processed, performs segmentation processing on any audio frame in one or more audio frames included in the audio to be processed to obtain a plurality of audio segments, determines the audio category of each audio segment in the plurality of audio segments, determining the voice recognition result of any audio frame according to the audio category of each audio clip, removing the audio frames with the voice recognition result as the target recognition result in the audio to be processed according to the voice recognition result of each audio frame to obtain processed audio, performing style conversion processing on the processed audio to obtain target audio, the embodiment can eliminate the non-speech audio in the audio to be processed by using the speech recognition result of each audio frame included in the audio to be processed, the method can reduce external interference, improve the audio quality of the audio, and is favorable for improving the accuracy of the voice related to audio style conversion.

As shown in fig. 12, fig. 12 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and an internal structure of the computer device 120 is shown in fig. 12, and includes: one or more processors 1201, memory 1202, and a communication interface 1203. The processor 1201, the memory 1202, and the communication interface 1203 may be connected by a bus 1204 or other means, and the embodiment of the present application is exemplified by being connected by the bus 1204.

The processor 1201 (or CPU) is a computing core and a control core of the computer device 120, and can analyze various instructions in the computer device 120 and process various data of the computer device 120, for example: the CPU may be configured to analyze a power on/off instruction sent by the user to the computer device 120, and control the computer device 120 to perform power on/off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the computer device 120, and so on. The communication interface 1203 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), controlled by the processor 1201 for transceiving data. Memory 1202(Memory) is a Memory device in computer device 120 for storing programs and data. It is understood that the memory 1202 may comprise both the built-in memory of the computer device 120 and, of course, the expansion memory supported by the computer device 120. Memory 1202 provides storage space that stores an operating system for computer device 120, which may include, but is not limited to: windows system, Linux system, etc., which are not limited in this application.

In an embodiment, the processor 1201 is specifically configured to:

acquiring audio waveform data of any audio frame;

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the above embodiments of the audio processing method. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

One or more embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of audio processing, the method comprising:

2. The method of claim 1, wherein the determining the audio category for each of the plurality of audio segments comprises:

3. The method of claim 1, wherein the determining the speech recognition result of any audio frame according to the audio category of each audio segment comprises:

4. The method of claim 3, wherein said determining a speech recognition result for any audio frame based on said ratio comprises:

5. The method of claim 3, further comprising:

6. The method of claim 1, wherein performing the style conversion on the processed audio to obtain a target audio comprises:

7. The method of claim 6, wherein performing speech recognition processing on the processed audio to obtain text information of the processed audio comprises:

8. The method according to claim 1, wherein the slicing the any audio frame to obtain a plurality of audio segments comprises:

acquiring audio waveform data of any audio frame;

9. An audio processing apparatus, characterized in that the apparatus comprises:

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the audio processing method according to any one of claims 1 to 8.