CN112397102A - Audio processing method and device and terminal - Google Patents

Audio processing method and device and terminal Download PDF

Info

Publication number
CN112397102A
CN112397102A CN201910749571.8A CN201910749571A CN112397102A CN 112397102 A CN112397102 A CN 112397102A CN 201910749571 A CN201910749571 A CN 201910749571A CN 112397102 A CN112397102 A CN 112397102A
Authority
CN
China
Prior art keywords
audio
file
target
clip
audio file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910749571.8A
Other languages
Chinese (zh)
Other versions
CN112397102B (en
Inventor
胡贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910749571.8A priority Critical patent/CN112397102B/en
Publication of CN112397102A publication Critical patent/CN112397102A/en
Application granted granted Critical
Publication of CN112397102B publication Critical patent/CN112397102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/18Error detection or correction; Testing, e.g. of drop-outs
    • G11B20/1803Error detection or correction; Testing, e.g. of drop-outs by redundancy in data representation
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data
    • G11B2020/10555Audio or video recording specifically adapted for audio data wherein the frequency, the amplitude, or other characteristics of the audio signal is taken into account

Abstract

The application discloses an audio processing method, an audio processing device and a terminal. The corresponding audio clip can be acquired from the audio which does not meet the requirement of the first audio signal and added into the detected audio data, the loss of part of the audio data with the first audio signal as the detection point can be avoided, and if the first audio signal represents a signal with sound, the loss of voice data can be reduced, and the requirement of a user is met.

Description

Audio processing method and device and terminal
Technical Field
The present application relates to the field of technologies, and in particular, to an audio processing method, an audio processing device, and a terminal.
Background
The voice activity detection is to identify and eliminate a long silent period from a sound signal stream, so that the finally retained audio file is an identified audio with sound, but due to the influence of factors such as environment, part of voice data may be lost in the finally retained audio file, and thus the user's requirements cannot be met.
Disclosure of Invention
In view of this, the present application provides an audio processing method, an audio processing apparatus, and a terminal, so that loss of audio data can be reduced, and user requirements are met.
To achieve the above object, in one aspect, the present application provides an audio processing method, including:
starting audio recording, and generating a first audio file from the recorded audio in response to detecting a first audio signal;
acquiring a target audio clip from a first audio file, and storing the target audio clip to a second audio file;
and writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
In one possible implementation manner, the obtaining a target audio clip in the first audio file includes:
acquiring audio recording duration corresponding to a first audio file and audio duration of audio data matched with the first audio signal;
obtaining backtracking audio time length according to the audio recording time length and the audio time length;
and extracting an audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
In another possible implementation manner, the extracting, from the first audio file, an audio segment corresponding to the backtracking audio duration to obtain a target audio segment includes:
obtaining the backtracking audio length according to the backtracking audio duration;
acquiring an ending audio frame of the first audio file;
and selecting an audio clip with the length being the backtracking audio length forward from the ending audio frame to obtain a target audio clip.
In another aspect, the present application further provides an audio processing apparatus, including:
the generating unit is used for starting audio recording and responding to the detection of the first audio signal to generate a first audio file from the recorded audio;
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a target audio clip in a first audio file and storing the target audio clip into a second audio file;
and the writing unit is used for writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
In another aspect, the present application further provides a terminal, including:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is to store a program to at least:
starting audio recording, and generating a first audio file from the recorded audio in response to detecting a first audio signal;
acquiring a target audio clip from a first audio file, and storing the target audio clip to a second audio file;
and writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
It can be seen that, the audio before the first audio signal is detected is recorded and stored in the first audio file, and the target audio file is obtained by combining the target audio segment in the first audio file and the detected audio data matched with the first audio signal. The corresponding audio clip can be acquired from the audio which does not meet the requirement of the first audio signal and added into the detected audio data, the loss of part of the audio data with the first audio signal as the detection point can be avoided, and if the first audio signal represents a signal with sound, the loss of voice data can be reduced, and the requirement of a user is met.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.
FIG. 1 illustrates a block diagram of an audio processing system according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating an audio processing method according to an embodiment of the present application;
FIG. 3 illustrates a waveform diagram of audio data according to an embodiment of the present application;
FIG. 4 is a diagram illustrating an example of an audio application display interface according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a method for obtaining a target audio segment according to an embodiment of the present application;
FIG. 6 is a diagram illustrating an example of audio data backtracking according to an embodiment of the present application;
FIG. 7 is a diagram illustrating an example of an audio detection scenario in accordance with an embodiment of the present application;
FIG. 8 is a flowchart illustrating a further method for obtaining a target audio clip according to an embodiment of the present application;
FIG. 9 is a block diagram of an embodiment of an audio processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic diagram illustrating a configuration of a terminal according to an embodiment of the present application.
Detailed Description
According to the scheme, the audio file meeting the requirement can be accurately acquired in the audio application processes of audio recording, audio identification and the like, so that the loss of certain audio data is reduced.
In the embodiment of the present application, the audio refers to a sound existing in a scene, which may or may not be recognized by human ears, and the sound includes some sound signals with specific characteristics, and may also include sound signals such as noise.
In order to facilitate understanding of the audio processing method of the present application, a system to which the audio processing method of the present application is applied will be described below. Referring to fig. 1, a schematic diagram of an architecture of an audio processing system according to the present invention is shown.
As shown in fig. 1, an audio processing system provided in an embodiment of the present application includes: an audio terminal 10 and a server 11. The audio terminal 10 and the server 11 are connected in communication via a network 12.
The audio terminal 10 may be a mobile terminal such as a mobile phone and a tablet computer, or a fixed terminal such as a personal computer with an audio acquisition function.
In the embodiment of the present application, the audio terminal 10 may be configured or connected with an audio collecting component such as a microphone to collect audio to be recorded, and transmit the recorded audio file to the server 11 through the network 12.
Accordingly, the server 11 may generate a final target audio file according to the audio collected by the audio terminal, and the final target audio file is output or stored through the audio terminal 10, or stored in a database of the server 11, so as to facilitate subsequent applications.
It should be noted that, during the audio capture or audio recording process of the audio terminal 10, a segmented capture mode is adopted, that is, the audio terminal performs audio recording first to obtain a first audio file, and the first audio file is recorded before the first audio signal is detected. If the audio terminal records the sound emitted by the user, the sectional acquisition mode adopted by the audio terminal in the sound recording process is not sensible to the user, that is, the user only needs to emit the corresponding sound to be recorded, and the user cannot feel that the audio terminal records the sound in the sectional acquisition mode.
The first audio signal characterizes a detection threshold, which may represent a decibel value of a sound, or may represent a sound having a particular characteristic, such as a lady's voice, a child's voice, etc. Because the first audio signal is a signal parameter or a signal characteristic related to the detection threshold, when audio data is detected, some audio data meeting or approximating the acquisition or recording standard is lost due to the setting of the threshold value or the delay of acquisition. For example, it is necessary to record the speech of the user, and if a sound decibel threshold is set, the current sound data will be recorded when the sound pitch is detected to be higher than the high threshold, which is used as the final target audio file, so that if the user produces a smaller sound, the part of the sound data will be lost. In addition, if the sound decibel threshold is set to a small value, such as close to zero, the recorded audio file will include a large number of unnecessary audio segments, which causes the recorded audio file to occupy a large amount of resources during storage or transmission, resulting in poor effect.
Therefore, in the embodiment of the present application, the audio data before the first audio signal is detected is stored to obtain the first audio file, and then the detected audio data matching with the first audio signal is stored, so that the target audio file can be obtained by the previously stored first audio file and the actually detected audio data matching with the first audio signal. Specifically, a target audio clip is acquired from a first audio file, the target audio clip is stored into a second audio file, and then detected audio data matched with a first audio signal is stored into the second audio file continuously until the audio data matched with the first audio signal cannot be detected, wherein the second audio file is the target audio file. The target audio clip acquired in the first audio file may be audio data having similar characteristics to the first audio signal, and if the first audio signal represents a sound decibel threshold, the target audio clip is sound data within a certain decibel range smaller than the sound decibel threshold; the target audio segment may also be lost sound data caused by the detection delay, for example, when the first audio signal is detected to be stored, there may be a recording delay, which causes a part of the sound data not to be recorded in the corresponding audio file, and the part of the audio data, which is stored in the first audio file, may extract the corresponding target audio segment therein, and combine with the recorded sound data to form the final target audio file.
The first recording module of the audio terminal for recording the audio before the first audio signal is detected may not be the same recording module as the second recording module of the detected audio data matched with the first audio signal, or may be the same module. As shown in fig. 1, after the audio terminal 10 records the first audio file, the first audio file is sent to the server 11, at this time, the server 11 stores the first audio file in the audio database 110 in the server, and then the server 11 obtains the target audio clip from the first audio file and stores the target audio clip in the second audio file, where the second audio file is empty before the target audio clip is stored. Then, after detecting the first audio signal, the audio terminal 10 will continue to write the subsequent audio data into the second audio file, where it is to be noted that the audio data written into the second audio file is audio data matched with the first audio signal, where matching refers to audio data having the same characteristic as the first audio signal or conforming to a certain deviation range, such as a sound decibel threshold represented by the first audio signal, that is, when detecting that the sound decibel threshold is the same as the current sound decibel threshold, or sound data within a certain range where the sound decibel fluctuates up and down may be stored as audio data matched with the first audio signal.
Optionally, an application, such as an audio recording application, an english learning application, and the like, may be run in the audio terminal, where the application is used for communicating with the server, and the audio terminal performs information interaction with the server through the application.
The following describes the interaction process between the terminal and the server in detail.
Referring to fig. 2, which shows a schematic flow interaction diagram of an embodiment of an audio processing method according to the present application, the method of the embodiment may include:
s201, the terminal starts audio recording, and responds to the detection of a first audio signal, the recorded audio is generated into a first audio file.
The terminal may record audio according to a start instruction, where the start instruction may be an instruction generated along with the start of a corresponding audio application in the terminal, such as a dubbing application in the terminal, and when a user clicks to enter the dubbing application, a start instruction is generated to instruct the terminal to start an audio recording function, that is, start recording current audio data. The corresponding starting instruction may also be a timing starting instruction set according to time, for example, the audio recording is started after a preset time length after a certain audio application is selected by the user, for example, the audio recording is started after 5s of starting the audio application by the user. Of course, the start command may also be a command input by the user, such as a user clicking a start recording button to generate an audio recording command. At this time, when the user clicks the start recording button, the user does not necessarily speak immediately, or the speaking sound decibel of the user does not necessarily reach the detected decibel threshold value, and since the audio recording is already performed at this time, the detected audio data can be recorded into the first audio file in real time.
In a possible case, before the first audio signal is detected, the terminal-initiated audio recording function may be an audio recording function initiated by an application after the terminal runs the application, that is, the application of the terminal implements the audio recording before the first signal is detected.
In another possible case, before the first audio signal is detected, the audio recording function module of the terminal records the audio data, that is, the audio application in the terminal only records the audio data after the first audio signal is detected, and the audio data before the first audio signal is recorded by the recording function module in the terminal other than the audio application, so that the response range of the audio application can be reduced, the number of files cached in the audio application is minimized, and the subsequent use and management are facilitated.
S202, the terminal sends the first audio file to a server;
s203, the server acquires a target audio clip from the first audio file;
s204, the server stores the target audio clip into a second audio file;
s205, the server sends the second audio file to the terminal.
The first audio file is audio data recorded before the first audio signal is detected, for example, the first audio signal represents a signal spoken by a user, and the first audio file records the audio data before the user speaks.
In this embodiment of the application, after the terminal generates the first audio file, the first audio file may be directly sent to the server, or the audio data in the first audio file may be sent to the server after format conversion, or a file identifier of the first audio file may be generated, and the first audio file with the file identifier is sent to the server, where the file identifier may be a time identifier for audio recording, a terminal identifier, or other identification information that can distinguish the audio file from other audio files.
After the server acquires the first audio file and the terminal has detected the first audio signal, the server may acquire a target audio clip in the first audio file, where the target audio clip at least includes audio data of a part of the first audio file, and the server may store audio clip extraction rules for different audio applications in advance, or extract the target audio clip based on sampling parameters of a current audio application.
In a possible case, if the server stores an audio clip extraction rule for a certain audio application, when it is detected that the first audio file sent by the terminal is for the audio application, the extraction rule for the audio application is invoked to extract a target audio clip in the first audio file. Wherein, the extraction rule can directly define the position of the initial audio frame of the extracted audio segment and the length of the audio segment to be extracted; it is also possible to directly define the audio segment length, and by default, the audio frame of the audio segment length is extracted from the last audio frame of the first audio file forward to serve as the target audio segment. The reason why the extraction is performed in a backward-forward mode is that the closer to the audio frame of the detected first audio signal, the more likely there is an audio feature matching the first audio signal, and the more consistent the audio data following the detected first audio signal, the more convenient the subsequent use and analysis.
In another possible case, if the server does not store the extraction rule for the audio application, a default extraction rule is adopted, that is, the server analyzes the audio sampling parameter of the audio application currently running in the terminal, calculates the length of the audio segment to be extracted according to the sampling parameter, and then performs audio extraction according to the length of the audio segment to obtain the target audio.
For another example, if the first audio signal represents audio data with a specific timbre, the server may obtain audio data matching the specific timbre in the first audio file through an extraction model, where the extraction model is trained by a large amount of sound data with different timbres and different decibels, so that audio data with the same timbre as the target specific timbre but possibly different sound decibels may be identified in the audio data in the first audio file.
Of course, the extraction condition of the audio segment may have other possibilities, and is not limited herein.
For convenience of subsequent analysis and convenience of use, after the target audio segment is extracted, the server stores the target audio segment into a second audio file, and the second audio file is an empty file before the target audio segment is stored. The target audio clip may be a start audio clip of the second audio file, or may be a part of the second audio file after storing other audio data according to a requirement, but needs to be distinguished from other audio data, and may be distinguished by setting a storage identifier thereof, or may be distinguished by setting a storage start frame and a storage end frame thereof.
And after the server stores the target audio clip into the second audio file, the second audio file is sent to the terminal, so that the subsequent application of the terminal is facilitated.
S206, the terminal writes the detected audio data matched with the first audio signal into a second audio file to obtain a target audio file.
And then, the terminal continuously writes the audio data matched with the first audio signal, which are continuously detected, into the second audio file until the audio data matched with the first audio signal cannot be detected. At this time, the second audio file includes not only the target audio clip but also the detected audio data meeting the expected conditions, and at this time, the second audio file is taken as the target audio file to be saved or output.
For convenience of understanding, an audio processing method provided by the embodiment of the present application is described with reference to a waveform diagram of audio data of fig. 3. Fig. 3(a) shows a waveform diagram of a segment of completed audio data, if an audio processing method in the prior art is adopted, that is, an audio detection threshold is set, for example, a first audio signal is used as the audio detection threshold, and recording of audio is started after the first audio signal is detected to generate an audio file, where the waveform diagram of the audio data in the audio file is as shown in fig. 3(b), it can be seen that a part of the audio data is lost, and the lost part of the audio data is as shown in fig. 3 (c). The purpose of the embodiment of the present application is to restore the lost audio data back and then splice the lost audio data to the front of the audio data shown in fig. 3(b), so as to obtain the audio data in the target audio file, where the waveform of the audio data is shown in fig. 3 (d).
The method for processing audio data in the prior art may be Voice Activity Detection (VAD), also called Voice endpoint detection or Voice boundary detection. The method aims to identify and eliminate a long-time mute period from a sound signal stream so as to achieve the effect of saving speech channel resources under the condition of not reducing service quality, and mute inhibition can save precious bandwidth resources and is beneficial to reducing end-to-end time delay felt by a user. Therefore, the audio processing method provided in the embodiment of the present application can be used to solve the problem that when the VAD performs the forward silence detection, due to environmental noise, the size of the user speaking voice, the speed of speech, and other reasons, a certain judgment threshold and a certain delay exist between the time when the actual speaking starts and the time when the voice is detected, so that an error occurs when whether the user speaks open or not is judged, and finally, part of the user voice is lost in the forward silence detection. The target audio clip can be extracted from the first audio file stored firstly to restore the target audio clip to the detected audio data to form complete audio data, so that the loss of part of user voice is avoided.
The audio processing method of the present application is described below with a specific application example. Referring to FIG. 4, an exemplary diagram of an audio application display interface is shown. Namely, the audio application displayed in the terminal is a child English follow-up reading application. English words and pictures corresponding to the words, such as 'Dog', which need to be read by children are displayed on an interface, when a microphone icon is displayed on the display interface, it indicates that the children need to read the words, if the children do not speak within a period of time after the microphone icon appears, the children start speaking at the current position of the progress bar in FIG. 4, at this time, forward silence detection can be performed by using VAD to collect the voice data starting speaking, the previous audio data can be stored as a first audio file, a target audio clip is extracted from the first audio file, the target audio clip may be an audio clip with a small sound decibel and may also be an audio clip lost due to time delay, and then the audio clip is stored before the collected voice data starting speaking, so that the specific situation that the children follow-up reading can be known during playback, the extracted target audio clip may be an audio clip with a small sound decibel of the child, and some defects of reading following of the child, such as inaccurate pronunciation or insecure mastering, can be known by analyzing the audio clip, so as to help the child to better learn English.
On the other hand, the data of the previous section without speaking is not collected, that is, the audio segment obtained by VAD detection and regarded as a silence segment is not directly added to the collected sound data, so as to avoid the existence of a large number of silence segments to occupy bandwidth and storage space. The reason is that similar to the above-mentioned english reading application, the reading audio of the student needs to be uploaded to the server or the corresponding external education terminal, and by the audio processing method provided by the application, the problem of slow transmission speed caused by excessive bandwidth occupied when sending the audio data including a large number of silent segments can be avoided, and the loss of effective audio segments can also be avoided.
A possible implementation of the present application is explained below.
In another embodiment of the present application, a method for obtaining a target audio segment is provided, and referring to fig. 5, a flowchart of the method for obtaining a target audio segment is shown. The method can comprise the following steps:
s501, acquiring audio recording duration corresponding to a first audio file and audio duration of audio data matched with a first audio signal;
s502, obtaining backtracking audio time according to the audio recording time and the audio time;
s503, extracting an audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
In this embodiment, the target audio clip in the first audio file is extracted by setting a time parameter. In a possible implementation manner, the audio recording duration of the first audio file and the audio duration of the audio data matched with the first audio signal are obtained, and if the first audio signal represents a signal that a user starts speaking, the audio recording duration may be understood as a mute duration, and the audio duration may be understood as a duration with sound.
In another implementation, the extraction may be by total duration and voiced audio duration. Taking VAD detection as an example, it determines that a time length (e.g., represented by a parameter begin _ confirm) corresponding to an audio duration is voiced within a time length of a detection unit (begin _ confirm _ window), which indicates that the user starts speaking and is no longer in a mute state.
Then, the backtracking audio time length is calculated according to the obtained time parameters, and the mode of obtaining the backtracking audio time length through the time parameters can better embody the continuity of audio acquisition time and the accuracy of backtracking lost segments.
The backtracking audio duration represents that the audio data of the duration needs to be backtracked in the first audio file, and because the backtracking is performed, the audio data corresponding to the duration is selected from the tail audio of the first audio file forward to serve as the target audio clip.
It should be noted that the trace-back audio duration reflects a time parameter, and corresponding audio bytes are extracted when audio segments are extracted from the first audio data, so that the trace-back audio length needs to be obtained according to the trace-back audio duration. Specifically, the backtracking audio length may be determined according to a corresponding relationship between the duration and the audio byte length, or the byte number of the backtracking audio length may be calculated according to the sampling data of the audio application operated by the current terminal and by backtracking the audio duration.
After the backtracking audio length, namely the number of bytes of the audio needing backtracking, is determined, an ending audio frame of the first audio file needs to be obtained, and an audio segment with the length being the backtracking audio length is selected forward from the ending audio frame to obtain a target audio segment.
Referring to fig. 6, which shows an exemplary backtrack of audio data, in fig. 6, 601 denotes an audio length of audio data in a first audio file, 602 denotes an end audio frame of the first audio file, 603 denotes a backtrack audio length, and 604 denotes a target audio clip. After the backtracking audio length is determined, the audio byte corresponding to the backtracking audio length is selected from the ending audio frame of the first audio file to be used as the target audio segment.
The embodiment of the application also provides a method for obtaining the backtracking audio length, which obtains the backtracking audio length by calculating the backtracking audio duration. Firstly, a preset sampling parameter is required to be obtained, wherein the preset sampling parameter represents a sampling parameter of an audio application currently running by the terminal or an audio sampling parameter of the terminal, and the essence of the preset sampling parameter is the sampling parameter of an audio file required to be obtained, namely the sampling parameter is related to the service of an audio application product.
Specifically, the preset sampling parameter may include a sampling rate, a sampling bit number, and a channel number, and therefore, the traceback audio length, that is, the size of the traceback audio byte, is calculated by the preset sampling parameter, so that the traceback audio and the detected audio data corresponding to the first audio signal have a uniform sampling parameter. The length of the backtracking audio is the duration of the backtracking audio, the sampling rate, the sampling bit number, and the channel number, that is, the byte size of the backtracking can be calculated by the above formula.
The following describes the above audio processing method with specific application examples. Referring to fig. 7, an exemplary diagram of an audio detection scenario is shown. In the audio detection scene, in order to detect and obtain data after the user speaks in the mouth, a voice data of the user is received by a microphone in a VAD detection mode.
Voice data acquired by the microphone is detected, the voice data representing real-time audio data, and possibly sound or no sound. Then, forward silence detection is performed by using VAD, which part of the voice of the user starts to exist from the beginning is detected, all audio data before speaking is stored in a first audio file, and then the audio data is acquired from the first audio file to be added to the recording file.
When detecting that the voice of the user exists from the beginning, calculating the time length of the audio needing backtracking according to the time length parameter of the detection unit set during the VAD detection.
For example, when begin _ confirm _ window represents the time of the detecting unit, then
Trace back time begin _ confirm _ window sample rate sample number of bits sample number of channels.
And after the time length needing backtracking is obtained, obtaining the byte size needing backtracking through the backtracking time sampling rate sampling digit track number. And then, acquiring the calculated data with the length of the byte size of the backtracking from back to front from the audio data stored in the first audio file, and writing the data into the record file to be stored.
And continuously writing the data after the user open speaking is detected into the recording file until the user finishes speaking or the recording is finished. The audio file at this time is the audio file obtained after the forward silence detection optimization, that is, the audio file with no lost voice data.
In the embodiment of the application, the mute segments contained in the target audio file are reduced through forward mute detection, so that the flow in the process of uploading and downloading the audio data is reduced; in subsequent activities which need to use the audio files as materials, better effects can be achieved after forward silence detection is carried out.
In another embodiment of the present application, there is also provided a method for obtaining a target audio segment, referring to fig. 8, which shows a flowchart of a method for obtaining a target audio segment, the method includes:
s801, acquiring a second audio signal corresponding to the first audio signal;
s802, extracting an audio clip matched with the second audio signal from the first audio file to obtain a target audio clip.
Wherein a decibel value of the second audio signal is less than a decibel value of the first audio signal. That is, the method obtains the target audio segment by lowering the detection threshold. Taking VAD detection of the speech sound of a user as an example, because there is a certain judgment threshold and time delay between the actual speech and the detection of the speech, sometimes the beginning and ending parts of the speech waveform will be lost as silence, and the restored speech will change, so it is necessary to add a speech packet in front of the burst speech to smooth to end the problem, the speech packet is the target audio segment, and it is obtained by lowering the detection threshold. That is, if the decibel value of the first audio signal is a and the decibel value of the second audio signal is B, B is smaller than a and the proportional range smaller than a is larger than the floating proportional range of a decibels.
It should be noted that, when the recorded audio is generated into the first audio file in the embodiment of the present application, it needs to be converted into a target format, and the target format is the same as the format of the first audio signal, that is, the target format is substantially the same as the format of the audio signal acquired with the target audio, so that splicing of subsequent audio is facilitated.
For example, the audio data in the first audio file is saved into a binary array, i.e., such that it is stored in a binary format. Since the sound collection module, such as a microphone device, converts the sound into binary data, the binary data is stored directly for later use, but other formats may be used, although conversion to other formats is required before the audio data is used and then converted back to the binary format.
On the basis of the above embodiment, the audio processing method further includes:
and carrying out noise reduction processing on the target audio file to enable the audio attribute of the target audio file to be matched with the audio attribute of the first audio signal.
By performing noise reduction processing on the audio data in the target audio file, larger noise can be removed, and the environmental sound in the target file can be the same as the environmental sound of the first audio signal, namely the environmental sound when the user speaks.
Naturally, the noise reduction processing may also be performed on the first audio file, so as to extract an effective target audio segment from a large background noise. The specific noise reduction method is not limited in the embodiment of the present application, and for example, noise reduction processing may be performed by adopting an open source method such as speex and WebRCT.
In another aspect, the present application also provides an audio processing apparatus, as shown in fig. 9, which shows a schematic composition diagram of an embodiment of an audio processing apparatus of the present application, where the apparatus of the present embodiment may be applied to a terminal, and the apparatus may include:
a generating unit 901, configured to start audio recording, and in response to detecting a first audio signal, generate a first audio file from the recorded audio;
an obtaining unit 902, configured to obtain a target audio clip in a first audio file, and store the target audio clip in a second audio file;
a writing unit 903, configured to write the detected audio data matched with the first audio signal into the second audio file, so as to obtain a target audio file.
In one possible case, the obtaining unit includes:
the first acquisition subunit is used for acquiring the audio recording duration corresponding to the first audio file and the audio duration of the audio data matched with the first audio signal;
the second obtaining subunit is configured to obtain a backtracking audio time length according to the audio recording time length and the audio time length;
and the first extraction subunit is used for extracting the audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
Optionally, the first extraction subunit includes:
the length obtaining subunit is configured to obtain a backtracking audio length according to the backtracking audio duration;
an audio frame acquisition subunit, configured to acquire an ending audio frame of the first audio file;
and the selecting subunit is used for selecting the audio clip with the length equal to the backtracking audio length forward from the ending audio frame to obtain a target audio clip.
Optionally, the length obtaining subunit is specifically configured to:
acquiring preset sampling parameters, wherein the preset sampling parameters comprise a sampling rate, a sampling digit and a channel number;
and calculating the backtracking audio time length according to the preset sampling parameters to obtain the backtracking audio length.
In yet another possible case, the obtaining unit includes:
a third obtaining subunit, configured to obtain a second audio signal corresponding to the first audio signal, where a decibel value of the second audio signal is smaller than a decibel value of the first audio signal;
and the second extraction subunit is used for extracting the audio clip matched with the second audio signal from the first audio file to obtain a target audio clip.
Optionally, the generating unit includes:
and the format conversion subunit is used for converting the recorded audio into a target format to obtain a first audio file, wherein the target format is the same as the format of the first audio signal.
Optionally, in an embodiment of any one of the above apparatuses, the apparatus further includes:
and the noise reduction unit is used for carrying out noise reduction processing on the target audio file so that the audio attribute of the target audio file is matched with the audio attribute of the first audio signal.
On the other hand, the present application also provides a terminal, as shown in fig. 10, which shows a schematic structural diagram of the terminal of the present application, and the terminal 1000 of this embodiment may include: a processor 1001 and a memory 1002.
Optionally, the terminal may further include a communication interface 1003, an input unit 1004, and a display 1005 and communication bus 1006.
The processor 1001, the memory 1002, the communication interface 1003, the input unit 1004, and the display 1005 communicate with each other via the communication bus 1006.
In the embodiment of the present application, the processor 1001 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, an off-the-shelf programmable gate array, or other programmable logic device.
The processor may call a program stored in the memory 1002. In particular, the processor may perform the operations in the above embodiments of the audio processing method.
The memory 1002 is used for storing one or more programs, which may include program codes including computer operation instructions, and in this embodiment, the memory stores at least the programs for implementing the following functions:
starting audio recording, and generating a first audio file from the recorded audio in response to detecting a first audio signal;
acquiring a target audio clip from a first audio file, and storing the target audio clip to a second audio file;
and writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
Further, the obtaining a target audio clip in the first audio file includes:
acquiring audio recording duration corresponding to a first audio file and audio duration of audio data matched with the first audio signal;
obtaining backtracking audio time length according to the audio recording time length and the audio time length;
and extracting an audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
Further, the extracting an audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip includes:
obtaining the backtracking audio length according to the backtracking audio duration;
acquiring an ending audio frame of the first audio file;
and selecting an audio clip with the length being the backtracking audio length forward from the ending audio frame to obtain a target audio clip.
Further, obtaining a backtracking audio length according to the backtracking audio duration includes:
acquiring preset sampling parameters, wherein the preset sampling parameters comprise a sampling rate, a sampling digit and a channel number;
and calculating the backtracking audio time length according to the preset sampling parameters to obtain the backtracking audio length.
Further, the obtaining a target audio clip in the first audio file includes:
acquiring a second audio signal corresponding to the first audio signal, wherein the decibel value of the second audio signal is smaller than the decibel value of the first audio signal;
and extracting an audio clip matched with the second audio signal from the first audio file to obtain a target audio clip.
Further, the generating the recorded audio into a first audio file includes:
and converting the recorded audio into a target format to obtain a first audio file, wherein the target format is the same as the format of the first audio signal.
Further, the method is characterized in that the method further comprises:
and carrying out noise reduction processing on the target audio file to enable the audio attribute of the target audio file to be matched with the audio attribute of the first audio signal.
In one possible implementation, the memory 1002 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as an information output function, etc.), and the like; the storage data area may store data created during use of the computer, such as audio extraction rules, noise reduction models, and the like.
Further, the memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.
The communication interface 1003 may be an interface of a communication module, such as an interface of a GSM module.
The present application may also include a display 1004 and an input unit 1005, and the like.
Of course, the structure of the terminal shown in fig. 10 is not limited to the terminal in the embodiment of the present application, and the terminal may include more or less components than those shown in fig. 10 or some components in combination in practical applications.
On the other hand, the embodiment of the present application further provides a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are recorded and executed by a processor, the audio processing method in any one of the above embodiments is implemented.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (10)

1. An audio processing method, comprising:
starting audio recording, and generating a first audio file from the recorded audio in response to detecting a first audio signal;
acquiring a target audio clip from a first audio file, and storing the target audio clip to a second audio file;
and writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
2. The method of claim 1, wherein obtaining the target audio clip in the first audio file comprises:
acquiring audio recording duration corresponding to a first audio file and audio duration of audio data matched with the first audio signal;
obtaining backtracking audio time length according to the audio recording time length and the audio time length;
and extracting an audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
3. The method according to claim 2, wherein the extracting an audio segment corresponding to the backtracking audio duration from the first audio file to obtain a target audio segment includes:
obtaining the backtracking audio length according to the backtracking audio duration;
acquiring an ending audio frame of the first audio file;
and selecting an audio clip with the length being the backtracking audio length forward from the ending audio frame to obtain a target audio clip.
4. The method according to claim 3, wherein obtaining the trace-back audio length according to the trace-back audio duration comprises:
acquiring preset sampling parameters, wherein the preset sampling parameters comprise a sampling rate, a sampling digit and a channel number;
and calculating the backtracking audio time length according to the preset sampling parameters to obtain the backtracking audio length.
5. The method of claim 1, wherein obtaining the target audio clip in the first audio file comprises:
acquiring a second audio signal corresponding to the first audio signal, wherein the decibel value of the second audio signal is smaller than the decibel value of the first audio signal;
and extracting an audio clip matched with the second audio signal from the first audio file to obtain a target audio clip.
6. The method of claim 1, wherein generating the recorded audio into a first audio file comprises:
and converting the recorded audio into a target format to obtain a first audio file, wherein the target format is the same as the format of the first audio signal.
7. The method of claim 1, further comprising:
and carrying out noise reduction processing on the target audio file to enable the audio attribute of the target audio file to be matched with the audio attribute of the first audio signal.
8. An audio processing apparatus, comprising:
the generating unit is used for starting audio recording and responding to the detection of the first audio signal to generate a first audio file from the recorded audio;
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a target audio clip in a first audio file and storing the target audio clip into a second audio file;
and the writing unit is used for writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
9. The apparatus of claim 8, wherein the obtaining unit comprises:
the first acquisition subunit is used for acquiring the audio recording duration corresponding to the first audio file and the audio duration of the audio data matched with the first audio signal;
the second obtaining subunit is configured to obtain a backtracking audio time length according to the audio recording time length and the audio time length;
and the extracting subunit is used for extracting the audio clip corresponding to the backtracking audio duration from the first audio file to obtain a target audio clip.
10. A terminal, comprising:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is to store a program to at least:
starting audio recording, and generating a first audio file from the recorded audio in response to detecting a first audio signal;
acquiring a target audio clip from a first audio file, and storing the target audio clip to a second audio file;
and writing the detected audio data matched with the first audio signal into the second audio file to obtain a target audio file.
CN201910749571.8A 2019-08-14 2019-08-14 Audio processing method and device and terminal Active CN112397102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910749571.8A CN112397102B (en) 2019-08-14 2019-08-14 Audio processing method and device and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910749571.8A CN112397102B (en) 2019-08-14 2019-08-14 Audio processing method and device and terminal

Publications (2)

Publication Number Publication Date
CN112397102A true CN112397102A (en) 2021-02-23
CN112397102B CN112397102B (en) 2022-07-08

Family

ID=74601390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910749571.8A Active CN112397102B (en) 2019-08-14 2019-08-14 Audio processing method and device and terminal

Country Status (1)

Country Link
CN (1) CN112397102B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113612808A (en) * 2021-10-09 2021-11-05 腾讯科技(深圳)有限公司 Audio processing method, related device, storage medium, and program product
CN113689862A (en) * 2021-08-23 2021-11-23 南京优飞保科信息技术有限公司 Quality inspection method and system for customer service seat voice data
CN114242105A (en) * 2022-02-24 2022-03-25 麒麟软件有限公司 Method and system for implementing recording and noise reduction on Android application

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152082A1 (en) * 2000-04-05 2002-10-17 Harradine Vincent Carl Audio/video reproducing apparatus and method
CN1717720A (en) * 2003-09-05 2006-01-04 松下电器产业株式会社 Acoustic processing system, acoustic processing device, acoustic processing method, acoustic processing program, and storage medium
WO2009052428A1 (en) * 2007-10-19 2009-04-23 Rebelvox, Llc Method and system for real-time media synchronisation across a network
KR101382356B1 (en) * 2013-07-05 2014-04-10 대한민국 Apparatus for forgery detection of audio file
WO2015184861A1 (en) * 2014-06-03 2015-12-10 华为技术有限公司 Method and device for processing audio and image information, and terminal device
CN106205652A (en) * 2016-07-11 2016-12-07 广东小天才科技有限公司 A kind of audio frequency is with reading evaluating method and device
CN106205607A (en) * 2015-05-05 2016-12-07 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
US20170092320A1 (en) * 2015-09-30 2017-03-30 Apple Inc. Automatic music recording and authoring tool
WO2017101260A1 (en) * 2015-12-15 2017-06-22 广州酷狗计算机科技有限公司 Method, device, and storage medium for audio switching
CN107018443A (en) * 2017-02-16 2017-08-04 乐蜜科技有限公司 Video recording method, device and electronic equipment
CN107910024A (en) * 2017-10-10 2018-04-13 深圳市金立通信设备有限公司 A kind of method of data recording, terminal and computer-readable recording medium
CN108449497A (en) * 2018-03-12 2018-08-24 广东欧珀移动通信有限公司 Voice communication data processing method, device, storage medium and mobile terminal
CN108470571A (en) * 2018-03-08 2018-08-31 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency detection, device and storage medium
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108831424A (en) * 2018-06-15 2018-11-16 广州酷狗计算机科技有限公司 Audio splicing method, apparatus and storage medium
CN108874904A (en) * 2018-05-24 2018-11-23 平安科技(深圳)有限公司 Speech message searching method, device, computer equipment and storage medium
CN108962283A (en) * 2018-01-29 2018-12-07 北京猎户星空科技有限公司 A kind of question terminates the determination method, apparatus and electronic equipment of mute time
CN108965757A (en) * 2018-08-02 2018-12-07 广州酷狗计算机科技有限公司 video recording method, device, terminal and storage medium
CN109065017A (en) * 2018-07-24 2018-12-21 Oppo(重庆)智能科技有限公司 Voice data generation method and relevant apparatus
CN109510890A (en) * 2018-11-19 2019-03-22 深圳市品声科技有限公司 A kind of built-in method and apparatus recorded of bluetooth conversation
CN109903751A (en) * 2017-12-08 2019-06-18 阿里巴巴集团控股有限公司 Keyword Verification and device

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152082A1 (en) * 2000-04-05 2002-10-17 Harradine Vincent Carl Audio/video reproducing apparatus and method
CN1717720A (en) * 2003-09-05 2006-01-04 松下电器产业株式会社 Acoustic processing system, acoustic processing device, acoustic processing method, acoustic processing program, and storage medium
WO2009052428A1 (en) * 2007-10-19 2009-04-23 Rebelvox, Llc Method and system for real-time media synchronisation across a network
KR101382356B1 (en) * 2013-07-05 2014-04-10 대한민국 Apparatus for forgery detection of audio file
WO2015184861A1 (en) * 2014-06-03 2015-12-10 华为技术有限公司 Method and device for processing audio and image information, and terminal device
CN106205607A (en) * 2015-05-05 2016-12-07 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
US20170092320A1 (en) * 2015-09-30 2017-03-30 Apple Inc. Automatic music recording and authoring tool
WO2017101260A1 (en) * 2015-12-15 2017-06-22 广州酷狗计算机科技有限公司 Method, device, and storage medium for audio switching
CN106205652A (en) * 2016-07-11 2016-12-07 广东小天才科技有限公司 A kind of audio frequency is with reading evaluating method and device
CN107018443A (en) * 2017-02-16 2017-08-04 乐蜜科技有限公司 Video recording method, device and electronic equipment
CN107910024A (en) * 2017-10-10 2018-04-13 深圳市金立通信设备有限公司 A kind of method of data recording, terminal and computer-readable recording medium
CN109903751A (en) * 2017-12-08 2019-06-18 阿里巴巴集团控股有限公司 Keyword Verification and device
CN108962283A (en) * 2018-01-29 2018-12-07 北京猎户星空科技有限公司 A kind of question terminates the determination method, apparatus and electronic equipment of mute time
CN108470571A (en) * 2018-03-08 2018-08-31 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency detection, device and storage medium
CN108449497A (en) * 2018-03-12 2018-08-24 广东欧珀移动通信有限公司 Voice communication data processing method, device, storage medium and mobile terminal
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108874904A (en) * 2018-05-24 2018-11-23 平安科技(深圳)有限公司 Speech message searching method, device, computer equipment and storage medium
CN108831424A (en) * 2018-06-15 2018-11-16 广州酷狗计算机科技有限公司 Audio splicing method, apparatus and storage medium
CN109065017A (en) * 2018-07-24 2018-12-21 Oppo(重庆)智能科技有限公司 Voice data generation method and relevant apparatus
CN108965757A (en) * 2018-08-02 2018-12-07 广州酷狗计算机科技有限公司 video recording method, device, terminal and storage medium
CN109510890A (en) * 2018-11-19 2019-03-22 深圳市品声科技有限公司 A kind of built-in method and apparatus recorded of bluetooth conversation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王良鸣等: "基于信号匹配算法的数字录音带检测与自动修复", 《计算机应用与软件》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689862A (en) * 2021-08-23 2021-11-23 南京优飞保科信息技术有限公司 Quality inspection method and system for customer service seat voice data
CN113689862B (en) * 2021-08-23 2024-03-22 南京优飞保科信息技术有限公司 Quality inspection method and system for customer service agent voice data
CN113612808A (en) * 2021-10-09 2021-11-05 腾讯科技(深圳)有限公司 Audio processing method, related device, storage medium, and program product
CN114242105A (en) * 2022-02-24 2022-03-25 麒麟软件有限公司 Method and system for implementing recording and noise reduction on Android application

Also Published As

Publication number Publication date
CN112397102B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
JP6811758B2 (en) Voice interaction methods, devices, devices and storage media
CN112397102B (en) Audio processing method and device and terminal
US9666208B1 (en) Hybrid audio representations for editing audio content
JP6060989B2 (en) Voice recording apparatus, voice recording method, and program
EP0887788B1 (en) Voice recognition apparatus for converting voice data present on a recording medium into text data
TW202008349A (en) Speech labeling method and apparatus, and device
JP6078964B2 (en) Spoken dialogue system and program
CN108242238B (en) Audio file generation method and device and terminal equipment
JP2014240940A (en) Dictation support device, method and program
WO2018130173A1 (en) Dubbing method, terminal device, server and storage medium
JP2017129720A (en) Information processing system, information processing apparatus, information processing method, and information processing program
CN111105776A (en) Audio playing device and playing method thereof
US8615153B2 (en) Multi-media data editing system, method and electronic device using same
CN114999464A (en) Voice data processing method and device
CN112750458B (en) Touch screen sound detection method and device
EP2261900A1 (en) Method and apparatus for modifying the playback rate of audio-video signals
CN113761865A (en) Sound and text realignment and information presentation method and device, electronic equipment and storage medium
CN112562688A (en) Voice transcription method, device, recording pen and storage medium
CN110289010B (en) Sound collection method, device, equipment and computer storage medium
CN114242120B (en) Audio editing method and audio marking method based on DTMF technology
CN112218137B (en) Multimedia data acquisition method, device, equipment and medium
CN108235137B (en) Method and device for judging channel switching action through sound waveform and television
US20090082887A1 (en) Method and User Interface for Creating an Audio Recording Using a Document Paradigm
CN116364065A (en) Sample data determining method, device, electronic equipment and storage medium
CN117041409A (en) Voice information sending method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant