CN116721662B

CN116721662B - Audio processing method and device, storage medium and electronic equipment

Info

Publication number: CN116721662B
Application number: CN202310053892.0A
Authority: CN
Inventors: 刘艳鑫
Original assignee: Beijing Intengine Technology Co Ltd
Current assignee: Beijing Intengine Technology Co Ltd
Priority date: 2023-02-03
Filing date: 2023-02-03
Publication date: 2023-12-01
Anticipated expiration: 2043-02-03
Also published as: CN116721662A

Abstract

The embodiment of the application discloses an audio processing method, an audio processing device, a storage medium and electronic equipment. The method comprises the following steps: acquiring initial audio data, converting the sampling rate of the initial audio data, judging whether the initial audio data contains associated subtitle files, if not, performing voice extraction processing on the audio data after converting the sampling rate, and segmenting the extracted voice audio according to a preset audio duration interval and a mute duration threshold value to obtain a plurality of sub-audios containing start-stop time stamps, and acquiring subtitle files corresponding to the plurality of sub-audios respectively. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

Description

Audio processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of audio data processing technologies, and in particular, to an audio processing method, an audio processing device, a storage medium, and an electronic device.

Background

In recent years, with the popularization of smart speakers, voice assistants, and the like, voice recognition is increasingly accepted, and the application of the technology is also increasingly in a scene such as: the control of the device by voice, the realization of the content search, is an important part of the daily life of the person. However, it is difficult to train a commercially available speech recognition system, because training such a speech recognition system requires a large number of labeled corpora (tens of thousands of hours), and the cost of acquiring these corpora is too high.

At present, a method for acquiring training corpus is that a data company recruits and organizes user acquisition data, the acquired data needs to be cleaned and marked, but the execution of the process has a plurality of limitations, and a large amount of funds need to be invested; secondly, because manual participation is needed, the acquisition process is very long, and timeliness cannot be guaranteed. Another method is that massive corpora can be obtained from the internet at low cost, but the quality of such corpora cannot be guaranteed, for example: there are a lot of non-human noise, and the subtitle and audio cannot be completely corresponding.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and electronic equipment, which can remove non-human voice parts in audio, only segment human voice audio and acquire corresponding subtitle files, so that the accuracy of the finally obtained subtitle files is higher.

The embodiment of the application provides an audio processing method, which comprises the following steps:

acquiring initial audio data and converting the sampling rate of the initial audio data;

judging whether the initial audio data contains associated subtitle files or not;

if not, performing voice extraction processing on the audio data after the sampling rate conversion;

Dividing the extracted voice audio according to a preset audio duration interval and a mute duration threshold value to obtain a plurality of sub-audio containing start-stop time stamps;

and acquiring the subtitle files respectively corresponding to the plurality of sub-audio.

In an embodiment, the acquiring the initial audio data and converting the sampling rate of the initial audio data includes:

acquiring source codes of a current webpage, and acquiring URL information of target audio in the source codes by a regular expression method;

downloading the target audio through the URL information to obtain initial audio data;

and converting the sampling rate of the initial audio data according to a preset sampling rate.

In an embodiment, the method further comprises:

if the initial audio data contains the associated subtitle file, inputting the initial audio data and the associated subtitle file into a pre-constructed word level alignment model for operation;

and outputting a word level alignment result corresponding to the initial voice data.

In an embodiment, the mute duration threshold includes a plurality of thresholds that decrease in sequence, and the slicing the extracted voice audio according to a preset audio duration interval and the mute duration threshold includes:

Selecting a first segmentation point in the voice audio according to a first mute duration threshold, and carrying out first segmentation on the voice audio based on the first segmentation point to obtain a plurality of audio segments;

among the plurality of audio segments, the audio segment with the audio time length longer than the preset audio time length interval is segmented again based on a second mute time length threshold;

and sequentially segmenting according to the rest mute time thresholds until a plurality of sub-audios are obtained after segmentation according to the minimum mute time thresholds.

In an embodiment, the method further comprises:

respectively judging whether the audio duration of each sub-audio is smaller than the preset audio duration interval in the plurality of sub-audios;

if the current sub-audio is smaller than the target sub-audio, determining a target sub-audio in the sub-audio associated in the front-back direction according to the preset audio duration interval, and combining the current sub-audio with the target sub-audio.

In an embodiment, the acquiring the subtitle files corresponding to the plurality of sub-audio respectively includes:

determining language information corresponding to the plurality of sub-audio frequencies;

respectively carrying out voice recognition on the plurality of sub-audio according to the language information;

and respectively generating the subtitle files corresponding to the plurality of sub-audio according to the voice recognition result.

In an embodiment, the determining language information corresponding to the plurality of sub-audio includes:

extracting the voice characteristics of the sub-audio;

inputting the voice characteristics into a language classification model, and outputting probability values corresponding to the various language information;

and determining the target language information according to the probability value.

The embodiment of the application also provides an audio processing device, which comprises:

the conversion module is used for acquiring initial audio data and converting the sampling rate of the initial audio data;

the judging module is used for judging whether the initial audio data contains associated subtitle files or not;

the extraction module is used for carrying out voice extraction processing on the audio data after the sampling rate conversion when the judgment module judges that the sampling rate is not the same;

the segmentation module is used for segmenting the extracted voice audio according to a preset audio duration interval and a mute duration threshold value to obtain a plurality of sub-audio containing start-stop time stamps;

and the acquisition module is used for acquiring the subtitle files corresponding to the plurality of sub-audio respectively.

Embodiments of the present application also provide a storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the audio processing method according to any of the embodiments above.

The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the steps in the audio processing method according to any embodiment by calling the computer program stored in the memory.

According to the audio processing method, the device, the storage medium and the electronic equipment, initial audio data can be obtained, the sampling rate of the initial audio data is converted, whether the initial audio data contains related subtitle files is judged, if the initial audio data does not contain the related subtitle files, voice extraction processing is carried out on the audio data after the conversion of the sampling rate, the extracted voice audio is segmented according to a preset audio duration interval and a mute duration threshold value, so that a plurality of sub-audios containing start-stop time stamps are obtained, and subtitle files corresponding to the sub-audios are obtained. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic system diagram of an audio processing apparatus according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of an audio processing method according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of another audio processing method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of audio alignment according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Fig. 6 is a schematic diagram of another structure of an audio processing apparatus according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and electronic equipment. Specifically, the audio processing method of the embodiment of the present application may be performed by an electronic device or a server, where the electronic device may be a terminal. The terminal can be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personalComputer), a personal digital assistant (Personal Digital Assistant, PDA) and the like, and the terminal can also comprise a client, wherein the client can be a media playing client or an instant messaging client and the like.

For example, when the audio processing method is operated in the electronic device, the electronic device may acquire initial audio data, convert the sampling rate of the initial audio data, determine whether the initial audio data includes associated subtitle files, if not, perform voice extraction processing on the audio data after converting the sampling rate, segment the extracted voice audio according to a preset audio duration interval and a mute duration threshold, so as to obtain a plurality of sub-audios including start-stop time stamps, and acquire subtitle files corresponding to the plurality of sub-audios respectively. Wherein the terminal device may interact with the user through a graphical user interface. The way in which the terminal device presents the graphical user interface to the user may include a variety of ways, for example, the graphical user interface may be rendered for display on a display screen of the terminal device, or presented by holographic projection. For example, the terminal device may include a touch display screen for presenting a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface, and a processor.

Referring to fig. 1, fig. 1 is a schematic system diagram of an audio processing apparatus according to an embodiment of the application. The system may include at least one electronic device 1000, at least one server or personal computer 2000. The electronic device 1000 held by the user may be connected to different servers or personal computers through a network. The electronic device 1000 may be a terminal device having computing hardware capable of supporting and executing software products corresponding to multimedia. In addition, the electronic device 1000 may also have one or more multi-touch sensitive screens for sensing and obtaining input from a user through touch or slide operations performed at multiple points of the one or more touch sensitive display screens. In addition, the electronic device 1000 may be connected to a server or a personal computer 2000 through a network. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, the different electronic devices 1000 may be connected to other embedded platforms or to a server, a personal computer, or the like using their own bluetooth network or hotspot network. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

The embodiment of the application provides an audio processing method which can be executed by electronic equipment or a server. The embodiment of the application is described by taking an audio processing method executed by electronic equipment as an example. The electronic equipment comprises a touch display screen and a processor, wherein the touch display screen is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface. When a user operates the graphical user interface through the touch display screen, the graphical user interface can control local content of the electronic equipment by responding to a received operation instruction, and can also control content of a server side by responding to the received operation instruction. For example, the user-generated operational instructions acting on the graphical user interface include instructions for processing the initial audio data, and the processor is configured to launch a corresponding application upon receiving the user-provided instructions. Further, the processor is configured to render and draw a graphical user interface associated with the application on the touch-sensitive display screen. A touch display screen is a multi-touch-sensitive screen capable of sensing touch or slide operations performed simultaneously by a plurality of points on the screen. The user performs touch operation on the graphical user interface by using a finger, and when the graphical user interface detects the touch operation, the graphical user interface controls the graphical user interface of the application to display the corresponding operation.

Referring to fig. 2, the specific flow of the method may be as follows:

step 101, obtaining initial audio data and converting the sampling rate of the initial audio data.

In an embodiment, the initial audio data may be a pure audio file or a video file, or may be an audio file extracted from a video file, or may be an audio file uploaded by a user or an audio obtained by recording a file, or the audio file or the video file may be a file downloaded locally or from a network, or may be a file obtained from another device.

For example, a user may upload an audio file or a video file to be processed through a client, and after the client receives the uploaded audio file or video file, the client transmits the received audio file or video file to a server. The server may take the received audio file as the initial audio data, or the server may extract the initial audio data from the received audio file or the video file.

In an embodiment, the electronic device may directly obtain the initial audio data from the network side, the local storage medium, or the external storage medium; raw audio data which are processed can be obtained from a network end, a local storage medium or an external storage medium, and corresponding initial audio data can be obtained by preprocessing the raw audio data.

Taking the example of obtaining initial audio data from a network end, the electronic device can obtain audio data in a webpage, such as obtaining the source code of the current webpage through an incoming URL link, then obtaining the URL of the required video or audio by using a regular expression method, and storing information such as the ID of the video or audio and the URL. The corresponding video or audio is then downloaded via the saved video or audio URL, and in particular, all videos may be converted into the same format, such as mp4, avi, rmvb, etc., for the downloaded video file. For downloaded audio files, all audio may be converted to audio at a sampling rate of 16 kHz. The conversion tool may use ffmpeg, sox, etc., and the video or audio format may be adjusted according to the requirement, which is not further limited in the present application. That is, the step of acquiring the initial audio data and converting the sampling rate of the initial audio data may include: acquiring source codes of a current webpage, acquiring URL information of target audio in the source codes through a regular expression method, downloading the target audio through the URL information to obtain initial audio data, and converting the sampling rate of the initial audio data according to a preset sampling rate.

Step 102, determining whether the initial audio data contains an associated subtitle file, and if not, executing step 103.

In an embodiment, if the initial audio data includes an associated subtitle file, the subtitle file may be directly acquired, and the initial audio data is combined as a subsequent training corpus. If the initial audio data does not include the associated subtitle file, it may be further determined whether the current initial audio data includes the embedded subtitle, and if so, the subsequent step 103 may be continued.

If the initial audio data does not contain the associated subtitle file but contains the embedded subtitle, the embedded subtitle can be identified to extract the subtitle file. Specifically, for an audio file or a video file containing embedded subtitles, a flash word, a sports word, a keyword and a specific tag can be discarded, the word is discarded based on a classifier, and finally, a text box record which is not discarded is saved as a subtitle file. The keywords may include common text information, for example: CCTV, hunan wei-tv, zhejiang wei-tv, etc. If a text box record has a character in the list and the ratio of the duration of the character occurrence (t 2-t 1) to the total number of frames is greater than a threshold β (the threshold β may be adjusted as desired), the text box record is discarded. A particular tag refers to text for which the ratio of the text box record occurrence duration (t 2-t 1) to the total number of frames is greater than a threshold λ (which may be adjusted as desired, which is greater than a threshold β), such as: and if the continuous occurrence duration of the text box record in one video is 80% of the total duration, the characters in the text box record are considered to be specific labels and are not in the voice, and the text box record is abandoned.

Step 103, performing voice extraction processing on the audio data after the conversion of the sampling rate.

In this embodiment, the human voice of the audio data and the environmental sound are separated to obtain the human voice audio. In one implementation manner, the audio data can be input into an existing human voice separation model, and separation of human voice audio and environmental audio is performed to obtain human voice audio, wherein the human voice separation model can be a human voice separation model of a deep neural network based on PIT (Permutation Invariant Train, substitution invariant training). In another implementation manner, the separation tool is used to separate the voice audio and the accompaniment audio to obtain the voice audio, for example, voice extraction processing can be performed according to the frequency spectrum characteristic or the frequency characteristic of the audio data.

And 104, segmenting the extracted voice audio according to a preset audio duration interval and a mute duration threshold value to obtain a plurality of sub-audio containing start-stop time stamps.

In one embodiment, after the voice audio is extracted, it may be sliced by a VAD tool to obtain a plurality of sub-audio including start-stop time stamps. Specifically, in this embodiment, an audio duration interval [ min, max ] and a continuous silence duration threshold θ1 may be preset, and the vad slicing standard slices according to the set audio duration interval and continuous silence duration threshold, for example, a slicing point in the voice audio is selected according to the silence duration being greater than the silence duration threshold θ1, where the slicing point may be located at a midpoint of the silence segment, and the voice audio is sliced by the slicing point, so as to obtain multiple sub-audios. It should be noted that, the duration of the sub-audio after the segmentation should be within the audio duration interval [ min, max ], if the duration is smaller than the audio duration interval [ min, max ], a plurality of sub-audio may be combined.

Step 105, acquiring subtitle files corresponding to the plurality of sub-audios respectively.

In one embodiment, after a plurality of sub-audios are acquired, text information in each sub-audio may be extracted separately. The text information refers to text contained in the audio information, that is, content contained in the audio information is displayed in the form of text. And finally, making the text information in each sub-audio into a corresponding subtitle file, wherein the subtitle file contains a start-stop time stamp.

In one embodiment, the text information in each sub-audio may be extracted by speech recognition for converting a speech signal corresponding to the speech information into corresponding text information. Optionally, the input speech may be speech-recognized using a hidden markov model (Hidden Markov Model, HMM) to determine text information corresponding to the input speech; or, the obtained voice signal may be compared with the voices in the voice database to find the same voices, so as to obtain text information corresponding to the voices in the voice database as text information corresponding to the input voices, which is not limited in this embodiment.

As can be seen from the above, the audio processing method provided by the embodiment of the present application can obtain initial audio data, convert the sampling rate of the initial audio data, determine whether the initial audio data includes associated subtitle files, if not, perform voice extraction processing on the audio data after converting the sampling rate, segment the extracted voice audio according to a preset audio duration interval and a mute duration threshold, so as to obtain a plurality of sub-audios including start-stop time stamps, and obtain subtitle files corresponding to the plurality of sub-audios respectively. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

Fig. 3 is a schematic flow chart of an audio processing method according to an embodiment of the application. The specific flow of the method can be as follows:

step 201, obtaining initial audio data and converting the sampling rate of the initial audio data.

For example, the electronic device may obtain audio data in the web page, such as obtaining the source code of the current web page through the incoming URL link, and then obtain the URL of the required video or audio by using the regular expression method, and save the ID of the video or audio and information such as URL. The corresponding video or audio is then downloaded via the saved video or audio URL, and in particular, all videos may be converted into the same format, such as mp4, avi, rmvb, etc., for the downloaded video file. For downloaded audio files, all audio may be converted to audio at a sampling rate of 16 kHz.

Step 202, determining whether the initial audio data contains an associated subtitle file, and if not, executing step 203.

In an embodiment, if the initial audio data includes an associated subtitle file, the subtitle file may be directly acquired, and the initial audio data is combined as a subsequent training corpus. If the initial audio data does not include the associated subtitle file, it may be further determined whether the current initial audio data includes the embedded subtitle, and if so, the subsequent step 203 may be continued.

If the initial audio data contains the associated subtitle file, the initial audio data and the associated subtitle file may be further input into a pre-constructed word level alignment model for operation, and a word level alignment result corresponding to the initial voice data may be output. The word-level alignment model may be a pre-built model, for example, a pre-built end-to-end neural network model. On the basis, a phoneme level alignment result corresponding to the voice data can be further obtained through a phoneme level alignment model. Fig. 4 illustrates the steps of the method according to the embodiment of the present application, in which an effect diagram of a phoneme-level alignment transcript is obtained, and it can be seen that the phoneme-level alignment can be completely achieved according to the process performed by the steps.

And 203, performing voice extraction processing on the audio data after the conversion of the sampling rate.

In this embodiment, the human voice of the audio data and the environmental sound are separated to obtain the human voice audio. In an implementation manner, the audio data may be input into an existing voice separation model, and separation between the voice audio and the environmental audio is performed to obtain the voice audio, which is not further described in this embodiment.

Step 204, selecting a first segmentation point in the voice audio according to the first mute duration threshold, and performing first segmentation on the voice audio based on the first segmentation point to obtain a plurality of audio segments.

In step 205, among the plurality of audio segments, the audio segment with the audio time length greater than the preset audio time length interval is segmented again based on the second mute time length threshold.

And 206, sequentially segmenting according to the rest mute duration threshold until a plurality of sub-audios are obtained after segmentation according to the minimum mute duration threshold.

In the embodiment of the present application, the mute duration threshold includes a plurality of thresholds decreasing in sequence, for example, includes four continuous mute duration thresholds θ1, θ2, θ3, and θ4, and θ1, θ2, θ3, and θ4 decrease in sequence, in addition, in this embodiment, an audio duration interval [ min, max ] may be preset, when the slicing is performed, a first round selects a slicing point in the voice audio according to a mute duration greater than the mute duration threshold θ1, the slicing point is located at a midpoint of the mute segment, and then a second round of slicing is performed on an audio segment in which the audio duration is still greater than max in the audio segment after the first round of slicing, where the slicing point in the audio segment may be selected according to a mute duration greater than the mute duration threshold θ2, so as to complete the second round of slicing. And then, performing third round segmentation on the audio segments with the audio duration still being greater than max in the audio segments after the second round segmentation, and selecting segmentation points in the audio segments according to the mute time duration being greater than a mute duration threshold value theta 3 to finish the third round segmentation. Finally, fourth round segmentation is carried out on the audio segments with the audio duration still being greater than max in the audio segments after the third round segmentation, and at the moment, segmentation points in the audio segments can be selected according to the fact that the mute duration is greater than a mute duration threshold value theta 4 so as to complete fourth round segmentation.

Further, after all the 4 rounds of slicing are completed, if the audio segment with the audio time length larger than max still exists, the audio is only the voice part, and then slicing is not performed. If there is an audio segment with an audio market duration less than min, two or more audio segments may be combined until the audio duration is within the interval [ min, max ], such as combining the current audio segment with an audio segment adjacent to the previous segment or the next segment. That is, after four-wheel segmentation is performed to obtain a plurality of sub-audios, the method further includes: and respectively judging whether the audio duration of each sub-audio is smaller than a preset audio duration interval in the plurality of sub-audios, if so, determining a target sub-audio in the sub-audios associated in front and back according to the preset audio duration interval, and combining the current sub-audio with the target sub-audio.

Step 207, determining language information corresponding to the plurality of sub-audio, and respectively performing voice recognition on the plurality of sub-audio according to the language information.

In an embodiment, before performing speech recognition on the plurality of sub-audio, language information corresponding to the sub-audio may be determined, for example, features are extracted for the segmented sub-audio, and a trained language recognition model is input to identify a language corresponding to the audio, for example, the audio language is korean, and the result returns to korean. That is, the step of determining language information corresponding to the plurality of sub-audio may include: extracting voice characteristics of the sub-audio, inputting the voice characteristics into the language classification model, outputting probability values corresponding to the plurality of language information, and determining target language information according to the probability values.

When the feature extraction is carried out, pre-emphasis, framing, windowing and other preprocessing can be carried out on the original audio data of the sub-audio to obtain the original audio data. This is because the audio signal is a time-varying signal, but the audio signal can be considered stable in a very short time range, so that the original audio data can be subjected to framing processing for the convenience of subsequent processing. For example, the original audio data can be divided into frames of 20-40 milliseconds (ms), so that the problem that the subsequent frequency spectrum cannot be estimated due to insufficient data volume per frame when the framing time is too short can be avoided, and the problem that the assumption of audio signal stability is not met due to too long framing time can be avoided. In addition, in the process of framing processing, pre-emphasis and windowing can be performed simultaneously, for example, the offset of a time window can be set to be half of the frame length, so that the problem of overlarge characteristic change of spectrum leakage and audio signals is avoided, and the problem of signal discontinuity of short-time signals of each frame at the edges of two ends of the short-time signals is eliminated.

Furthermore, the initial audio data can be converted from the time domain to the frequency domain, that is, fourier transformation is performed on the initial audio data to obtain initial frequency domain data. The initial frequency domain data is a floating-point frequency spectrum amplitude, the numerical distribution range is wider, and if the initial frequency domain data is directly adopted to extract frequency domain characteristics and calculate frequency domains, the numerical distribution range of the frequency spectrum amplitude is further expanded, so that the required storage space is larger and the calculation speed is slower. Therefore, the electronic device may further perform integer conversion on the initial frequency domain data to convert floating point type data into integer type data, thereby obtaining intermediate frequency domain data. The electronic device may then perform feature extraction on the intermediate frequency-domain data, such as performing mel (mel) filtering, logarithmic or discrete cosine transform (DiscreteCosine Transform, DCT) on the intermediate frequency-domain data, to obtain frequency-domain features, i.e., target frequency-domain data.

In some embodiments, the step of performing integer conversion processing on the initial frequency domain data to generate intermediate frequency domain data may include: and carrying out normalization processing on the initial frequency domain data, and carrying out integer conversion processing on the normalized initial frequency domain data to generate intermediate frequency domain data.

Step 208, respectively generating caption files corresponding to the plurality of sub-audio according to the voice recognition result.

In one embodiment, the text information in each sub-audio may be obtained after speech recognition. The text information refers to text contained in the audio information, that is, content contained in the audio information is displayed in the form of text. And finally, making the text information in each sub-audio into a corresponding subtitle file, wherein the subtitle file contains a start-stop time stamp.

All the above technical solutions may be combined to form an optional embodiment of the present application, and will not be described in detail herein.

As can be seen from the foregoing, the audio processing method provided by the embodiment of the present application may obtain initial audio data, convert the sampling rate of the initial audio data, determine whether the initial audio data includes an associated subtitle file, if not, perform voice extraction processing on the audio data after converting the sampling rate, select a first division point in the voice audio according to a first mute time length threshold, and perform first division on the voice audio based on the first division point, so as to obtain a plurality of audio segments, in the plurality of audio segments, divide again for an audio segment with a time length greater than a preset audio time length interval based on a second mute time length threshold, divide sequentially according to the remaining mute time length threshold until after dividing according to a minimum mute time length threshold, determine language information corresponding to the plurality of sub-audio, respectively perform voice recognition on the plurality of sub-audio according to the language information, and respectively generate subtitle files corresponding to the plurality of sub-audio according to the voice recognition result. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

In order to facilitate better implementation of the audio processing method according to the embodiment of the present application, the embodiment of the present application further provides an audio processing device. Referring to fig. 5, fig. 5 is a schematic structural diagram of an audio processing device according to an embodiment of the application. The audio processing apparatus may include:

a conversion module 301, configured to obtain initial audio data, and convert a sampling rate of the initial audio data;

a judging module 302, configured to judge whether the initial audio data includes an associated subtitle file;

an extracting module 303, configured to perform a voice extracting process on the audio data after the sampling rate conversion when the judging module 302 judges that the sampling rate is not the same;

the segmentation module 304 is configured to segment the extracted voice audio according to a preset audio duration interval and a mute duration threshold, so as to obtain a plurality of sub-audio including start-stop time stamps;

and the acquiring module 305 is configured to acquire subtitle files corresponding to the plurality of sub-audio respectively.

In an embodiment, please further refer to fig. 6, fig. 6 is a schematic diagram illustrating another structure of an audio processing apparatus according to an embodiment of the present application. The segmentation module 304 may specifically include:

a first splitting module 3041, configured to select a first splitting point in the voice audio according to a first mute duration threshold, and perform first splitting on the voice audio based on the first splitting point, so as to obtain a plurality of audio segments;

A second segmentation module 3042, configured to segment, among the plurality of audio segments, an audio segment having an audio time length greater than the preset audio time length interval again based on a second mute time length threshold;

and a third segmentation module 3043, configured to sequentially segment according to the remaining mute duration threshold until a plurality of sub-audios are obtained after segmentation according to the minimum mute duration threshold.

In an embodiment, the obtaining module 305 may specifically include:

a determining submodule 3051, configured to determine language information corresponding to the plurality of sub-audio frequencies;

the recognition submodule 3052 is used for respectively carrying out voice recognition on the plurality of sub-audio according to the language information;

and the generating submodule 3053 is used for respectively generating subtitle files corresponding to the plurality of sub-audios according to the voice recognition result.

As can be seen from the above, in the audio processing apparatus provided by the embodiment of the present application, by acquiring initial audio data, converting the sampling rate of the initial audio data, determining whether the initial audio data includes an associated subtitle file, if not, performing a human voice extraction process on the audio data after converting the sampling rate, and slicing the extracted human voice audio according to a preset audio duration interval and a mute duration threshold, so as to obtain a plurality of sub-audios including start-stop time stamps, and acquiring subtitle files corresponding to the plurality of sub-audios respectively. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

Correspondingly, the embodiment of the application also provides electronic equipment which can be a terminal or a server, wherein the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personalComputer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7. The electronic device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device 400 using various interfaces and lines, and performs various functions of the electronic device 400 and processes data by running or loading software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device 400.

In the embodiment of the present application, the processor 401 in the electronic device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 7, the electronic device 400 further includes: a touch display 403, a radio frequency circuit 404, an audio circuit 405, an input unit 406, and a power supply 407. The processor 401 is electrically connected to the touch display 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power supply 407, respectively. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 7 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The touch display 403 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a liquid crystal Display (LCD, liquidCrystal Display), an organic light Emitting Diode (OLED, organicLight-Emitting Diode), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 401, and can receive and execute commands sent from the processor 401. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 401 to determine the type of touch event, and the processor 401 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to realize the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 403 may also implement an input function as part of the input unit 406.

In an embodiment of the present application, the graphical user interface is generated on the touch display 403 by the processor 401 executing an application program. The touch display 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The radio frequency circuitry 404 may be used to transceive radio frequency signals to establish wireless communication with a network device or other electronic device via wireless communication.

The audio circuitry 405 may be used to provide an audio interface between a user and an electronic device through a speaker, microphone. The audio circuit 405 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 405 and converted into audio data, which are processed by the audio data output processor 401 and sent via the radio frequency circuit 404 to e.g. another electronic device, or which are output to the memory 402 for further processing. The audio circuit 405 may also include an ear bud jack to provide communication of the peripheral headphones with the electronic device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the electronic device 400. Alternatively, the power supply 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 407 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 7, the electronic device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the electronic device provided in this embodiment may obtain initial audio data, convert the sampling rate of the initial audio data, determine whether the initial audio data includes an associated subtitle file, if not, perform a voice extraction process on the audio data after converting the sampling rate, segment the extracted voice audio according to a preset audio duration interval and a mute duration threshold, so as to obtain a plurality of sub-audios including start-stop time stamps, and obtain subtitle files corresponding to the plurality of sub-audios respectively. According to the embodiment of the application, the non-human voice part in the audio can be removed, only the human voice audio is segmented, and the corresponding subtitle file is acquired, so that the accuracy of the finally obtained subtitle file is higher.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions or by controlling associated hardware, which may be stored in a storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a storage medium in which a plurality of computer programs are stored, the computer programs being capable of being loaded by a processor to perform the steps of any of the audio processing methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access memory (RAM, random AccessMemory), magnetic or optical disk, and the like.

The steps of any audio processing method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects of any audio processing method provided by the embodiment of the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted.

The foregoing describes in detail an audio processing method, apparatus, storage medium and electronic device provided in the embodiments of the present application, and specific examples are applied to illustrate the principles and implementations of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. An audio processing method, comprising:

if the audio data does not contain the audio data, carrying out voice extraction processing on the audio data after the conversion of the sampling rate according to a voice separation model of the PIT-based deep neural network;

dividing the extracted voice audio according to a preset audio duration interval and a mute duration threshold to obtain a plurality of sub-audios comprising start-stop time stamps, wherein the mute duration threshold comprises a plurality of thresholds which are gradually decreased, a first dividing point is selected in the voice audio according to a first mute duration threshold, the voice audio is divided for the first time based on the first dividing point to obtain a plurality of audio segments, among the plurality of audio segments, the audio segment which is longer than the preset audio duration interval in terms of the audio is divided again based on a second mute duration threshold, and the dividing is sequentially performed according to the rest mute duration threshold until a plurality of sub-audios are obtained after the dividing according to the minimum mute duration threshold, and the plurality of sub-audios are located in the preset audio duration interval;

2. The audio processing method of claim 1, wherein the acquiring initial audio data and converting the sampling rate of the initial audio data comprises:

3. The audio processing method of claim 1, wherein the method further comprises:

outputting word level alignment results corresponding to the initial audio data.

4. The audio processing method of claim 1, wherein the method further comprises:

5. The audio processing method according to claim 1, wherein the acquiring subtitle files corresponding to the plurality of sub-audio respectively includes:

6. The audio processing method of claim 5, wherein said determining language information corresponding to the plurality of sub-audio comprises:

extracting the voice characteristics of the sub-audio;

7. An audio processing apparatus, comprising:

the extraction module is used for carrying out voice extraction processing on the audio data after the sampling rate conversion according to the voice separation model of the PIT-based deep neural network when the judgment module judges that the audio data is not the voice data;

the audio processing device comprises a segmentation module, a segmentation module and a processing module, wherein the segmentation module is used for segmenting the extracted voice audio according to a preset audio duration interval and a mute duration threshold to obtain a plurality of sub-audios comprising start-stop time stamps, the mute duration threshold comprises a plurality of thresholds which are gradually decreased, a first segmentation point is selected in the voice audio according to a first mute duration threshold, the voice audio is segmented for the first time based on the first segmentation point to obtain a plurality of audio segments, the audio segments which are longer than the preset audio duration interval in terms of audio duration are segmented again based on a second mute duration threshold, and segmentation is sequentially carried out according to the rest mute duration thresholds until a plurality of sub-audios are obtained after segmentation according to a minimum mute duration threshold, and the plurality of sub-audios are located in the preset audio duration interval;

8. A storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the audio processing method according to any of claims 1-6.

9. An electronic device comprising a memory in which a computer program is stored and a processor that performs the steps in the audio processing method according to any one of claims 1-6 by calling the computer program stored in the memory.