CN112509538A - Audio processing method, device, terminal and storage medium - Google Patents

Audio processing method, device, terminal and storage medium Download PDF

Info

Publication number
CN112509538A
CN112509538A CN202011511470.6A CN202011511470A CN112509538A CN 112509538 A CN112509538 A CN 112509538A CN 202011511470 A CN202011511470 A CN 202011511470A CN 112509538 A CN112509538 A CN 112509538A
Authority
CN
China
Prior art keywords
audio
user
information
synthetic
defined information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011511470.6A
Other languages
Chinese (zh)
Inventor
李琳
高山
张弛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN202011511470.6A priority Critical patent/CN112509538A/en
Publication of CN112509538A publication Critical patent/CN112509538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the invention relates to the field of playing and discloses an audio processing method, an audio processing device, an audio processing terminal and a storage medium. The audio processing method comprises the following steps: acquiring a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online; acquiring user-defined information of a user; the user-defined information comprises: text information or voice information; and acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the character audio removing track. In the embodiment of the invention, the user instruction is obtained in the first mode, the first audio needing to be secondarily authored is determined, and the synthetic audio is obtained by combining user-defined input. The user is not required to download the original file or the first audio where the first audio is located, secondary creation can be completed only by selecting and providing custom input during playing, the execution process is convenient and fast, and the user participation degree is highly improved.

Description

Audio processing method, device, terminal and storage medium
Technical Field
The present invention relates to the field of playback, and in particular, to an audio processing method, an audio processing apparatus, a terminal, and a storage medium.
Background
When watching movies or playing music files, users often have a wish to participate in content creation, and at present, the participation of users is basically embodied in a bullet screen through bullet screen text information.
If the user-defined voice information or text information is used as the speech line for secondary creation, the original file to be operated needs to be downloaded back to the local, and the synthesized audio of the user can be obtained after the original file is processed, so that video files such as movies and TV shows, and audio files such as songs and broadcasts, which are secondarily created, can be completed. The secondary creation process is complex, the efficiency is low, and the matching degree of the character sound in the original file is not high.
Disclosure of Invention
The embodiment of the invention aims to provide an audio processing method, an audio processing device, a terminal and a storage medium, which are convenient for a user to perform secondary creation on the audio of an existing file.
In order to solve the above technical problem, an embodiment of the present invention provides an audio processing method, including:
obtaining a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
and acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the character-tone-removing audio track.
An embodiment of the present invention further provides an audio processing apparatus, including:
an acquisition module: obtaining a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
a definition module: acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
a synthesis module: and acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the character-tone-removing audio track.
An embodiment of the present invention further provides a terminal, including: at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio processing method described above.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-described audio processing method.
Compared with the related art, the method and the device for obtaining the audio synthesis result have the advantages that the user instruction is obtained in the first mode, the first audio to be subjected to secondary creation is determined, and the synthesized audio is obtained by combining user-defined input. The user is not required to download the original file where the first audio is located, secondary creation can be completed only by determining the first audio and providing custom input during online playing, the execution process is convenient and fast, and the user participation degree is highly improved.
In addition, after the synthetic audio for replacing the first audio is obtained, the method further includes: providing a plurality of the synthesized audio in a second mode; determining a target synthetic audio among a plurality of the synthetic audios; and replacing the first audio with the target synthetic audio for playing. After the user completes the secondary creation of the first audio, the user can provide other users as optional alternative items in the process of playing the online played original file on line, thereby widening the propagation path of obtaining the synthetic audio by the secondary creation, improving the interest of playing the file and increasing the user participation.
In addition, before obtaining the synthetic audio according to the user-defined information and the audio track of the character removing sound, the method includes: detecting the matching degree of the user-defined information of the user and the first audio; if the user-defined information of the user is text information, inquiring the word number of the text information, and taking the matching degree of the word number of the text information and the word number of the speech table corresponding to the first audio as the matching degree of the user-defined information and the first audio; and if the user-defined information of the user is voice information, inquiring the duration of the voice information, and taking the matching degree of the duration of the voice information and the duration of the first audio as the matching degree of the user-defined information and the first audio. Before automatic secondary creation is carried out, custom information input by a user is judged, if the difference between the custom information and the first audio frequency is too large, the secondary creation naming is not executed, and the quality of the synthesized audio frequency after the secondary creation is finished is ensured.
In addition, acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the de-role audio track includes: if the user-defined information of the user is text information, converting the text information into standard audio to obtain a characteristic vector of the standard audio; acquiring a feature vector of a character sound in the first audio; and obtaining the synthetic audio according to the characteristic vector of the standard audio, the characteristic vector of the character sound in the first audio and the character sound removing audio track. And if the user-defined information is voice information, obtaining the synthesized audio according to the voice information and the character-removing sound track. The process of generating the synthetic audio is different, the synthetic audio can be obtained when a user inputs different self-defined information, the synthetic audio obtained from the text information is the tone of the character, the synthetic audio obtained from the voice information is the tone of the user, and the personalized requirements of the user are met.
Additionally, providing a plurality of the synthesized audios in a second mode includes: in the second mode, the display is made according to a ranking of the scores of the plurality of synthesized audios. The synthetic audio is sequenced according to a preset rule before recommendation, so that high-quality works can be found by more users, and meanwhile, the synthetic audio works with higher quality can be provided for the users.
In addition, after replacing the first audio with the target synthesized audio for playing, the method includes: and updating and grading the synthesized audio according to the playing state, the complete playing time and the played time of the synthesized audio. The scoring is not invariable, and after the user selects and plays the score, the scoring is adjusted according to the playing state, the complete playing time and the played time, so that the scoring system is more in line with the habit of the user, and the accuracy of the scoring system is improved.
Drawings
One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.
Fig. 1 is a flowchart of an audio processing method provided according to a first embodiment of the present invention;
fig. 2 is a flowchart of an implementation of a step in an audio processing method provided in accordance with a first embodiment of the invention;
fig. 3 is a flowchart of an audio processing method provided according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of an audio processing apparatus provided according to a third embodiment of the present invention;
fig. 5 is a schematic diagram of a terminal provided according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
The audio processing method of the invention is suitable for the secondary creation process of audio parts in video files such as movies, TV shows and the like and audio files such as songs, broadcasts and the like.
A first embodiment of the present invention relates to an audio processing method applied to a terminal, and a specific flow is shown in fig. 1.
Step 101, obtaining a chamfered-tone audio track of a first audio in a first mode, wherein the first audio is located in an original file played online;
102, acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
and 103, acquiring a synthetic audio for replacing the first audio according to the user-defined information and the de-horn audio track of the user.
The following describes the implementation details of the audio processing method of the present embodiment in detail, and the following is only provided for the convenience of understanding and is not necessary for implementing the present embodiment.
For step 101, a first mode is selected before playing the original file played online, and the first mode is an authoring mode for performing secondary authoring. In the playing process of an original file played online, acquiring a user instruction for determining a first audio, wherein the user instruction is used for selecting the first audio for secondary creation from the original file played online, denoising the first audio, acquiring a chamfer tone audio track of the first audio, and acquiring time node information of the first audio; or preparing a complete de-role audio track of the original file before the original file is played, directly determining the de-role audio track of the first audio through a user instruction, and obtaining the time node information of the first audio. For example, if the original file played online is a movie, the subtitle file in the SRT format of the movie is read to obtain the start-stop time of all the lines, and 1 second before the start of each line segment, whether the user inputs an instruction is monitored, and if the user inputs an instruction, the line segment corresponds to a first audio, and the start-stop time of the line segment corresponding to the first audio is obtained. If the original file played online is a song, the start-stop time corresponding to the first audio frequency corresponding to the lyrics is obtained through the lyric file in the format of song lrc.
In one example, after the first audio is determined, character sound track features are extracted through a Mel cepstrum algorithm, an AI (artificial intelligence) vocal elimination technology is called to obtain a music track with character sounds removed, the music track with the character sounds removed is used as a basis for secondary creation of the first audio, and the music track with the character sounds removed is adopted to better accord with the playing environment of the first audio. For example, the first audio is an inner distinctive A of the character under the waterfall, the first audio contains waterfall sound B1, insect sound B2, bird call B3, other noise B4 and the like besides the character sound A, the voiceprint feature of the character sound A is extracted through a Mel cepstrum algorithm, an AI (artificial intelligence) vocal eliminating technology is called, the audio track with the A removed and the B1, the B2, the B3 and the B4 reserved is obtained. Because the largest variable in the secondary creation is character sound A, the remaining invariants, namely the de-character sound track, need to be reserved to ensure that the secondary creation product has the same background effect as the first audio.
For step 102, after obtaining the audio track of the character removing sound, the user-defined information of the user is also required to be obtained, which includes: text information or voice information. Receiving user-defined information of a user, inputting text content or recorded voice segments as materials for secondary creation, wherein the recording of the voice segments requires the user to start related authorization. For example, the user-defined information is "sky-blue and so on, i.e. you are waiting for the rain", the terminal can receive directly input characters, and can also receive and record the voice of the paragraph as a synthesized material after the application authorization is opened.
For step 103, first, the matching degree between the user-defined information and the first audio needs to be detected, and the creation can be completed only if the matching degree meets the requirement. If the user-defined information of the user is text information, inquiring the word number of the text information, taking the matching degree of the word number of the text information and the word number of the speech table corresponding to the first audio as the matching degree of the user-defined information and the first audio, and if the word number of the user-defined information is within 80% -120% of the word number corresponding to the first audio, meeting the requirement; and if the user-defined information of the user is the voice information, inquiring the duration of the voice information, taking the matching degree of the duration of the voice information and the duration of the first audio frequency as the matching degree of the user-defined information and the first audio frequency, and if the duration of the user-defined information is within 80-120% of the corresponding duration of the first audio frequency, meeting the requirement. For example, the word corresponding to the first audio is 10 words "opened by calligraphy in the interior" and the user-defined input is 12 words "if Hua Tuo Chongyang is cured" and meets the requirement that the word number of the word corresponding to the first audio is 80% to 120% (i.e. 8 to 12 words) and the calculated result of the matching degree meets the requirement. This step limits the user-defined input so that the secondarily composed synthesized audio conforms to the playing environment of the first audio.
Through the character tone track and the user self-defined information, the composite audio of the secondarily created works is obtained, and the composite audio is used for replacing the first audio. If the user-defined information of the user is text information, converting the text information into standard audio to obtain a characteristic vector of the standard audio; acquiring a feature vector of a character sound in the first audio; obtaining a synthetic audio according to the characteristic vector of the standard audio, the characteristic vector of the character sound in the first audio and the audio track of the character sound removing; and if the user-defined information is voice information, obtaining a synthesized audio according to the voice information and the character removing sound track.
In an example, if the user-defined information is text information, the step of obtaining the synthesized audio for replacing the first audio according to the user-defined information and the deglazed audio track is as follows, and the specific flow is as shown in fig. 2.
Step 1041: converting the text information into standard audio to obtain a characteristic vector of the standard audio;
step 1042: acquiring a feature vector of a character sound in a first audio;
step 1043: and obtaining the synthetic audio according to the characteristic vector of the standard audio, the characteristic vector of the character sound in the first audio and the audio track of the de-horn sound.
For example, the user-defined information is text information C1, and is input to the encoding module according to the character voice and voice print characteristics VF0 extracted in step 101, discrete vectors of the character voice and voice print characteristics are extracted, and then the discrete vectors are converted into continuous vectors VE0 through an embedding operation. Invoking text-to-speech (Tex)A t To Speech, TTS) model, which converts a text C1 input by a user into a standard voice Ca according To a preset standard tone, obtains a characteristic continuous vector VEa of the standard voice at the same time, sends VE0 and VEa together into an encoding module, generates a final audio prediction vector by using an autoregressive Neural Network (RNN), and generates a final discrete audio mel spectrum vector VE3 after multiple iterations. Calling a high-speed audio synthesis WaveRNN program based on a neural network to carry out result reasoning on each discrete sample of VE3, wherein the prediction of each frame depends on the result of the previous frame, and the conditional probability product formula is adopted
Figure BDA0002846525270000051
And calculating, performing multiple iterations, and generating continuous and smooth frequency spectrum information by using discrete audio frequency spectrum information to obtain final voice C1 'mainly comprising the tone of the original video speech and the character input by the user, and performing sound mixing operation on the C1' and the de-voice sound track V0 to obtain a synthesized audio V1 after secondary creation.
In another example, the user-defined information is speech information, and the synthesized audio is obtained based on the speech information and the audio track of the character-removing sound. Optionally, the method one: carrying out direct sound mixing on the voice information and the audio track of the de-horn tone to obtain a synthetic audio; the second method comprises the following steps: and converting the voice information into text information, and then obtaining synthetic audio according to the process of taking the self-defined information as a text information example.
In this embodiment, a user instruction is obtained in the first mode, a first audio to be subjected to secondary authoring is determined, and a synthesized audio is obtained by combining user-defined input. The user is not required to download the original file or the first audio where the first audio is located, secondary creation can be completed only by selecting and providing custom input during playing, the execution process is convenient and fast, and the user participation degree is highly improved.
A second embodiment of the present invention relates to an audio processing method, and a flowchart is shown in fig. 3.
Step 201, obtaining a chamfered-tone audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
step 202, obtaining user-defined information of a user; the user-defined information comprises: text information or voice information;
step 203, acquiring a synthetic audio for replacing the first audio according to the user-defined information and the de-horn audio track of the user;
step 204, providing a plurality of synthetic audios in a second mode;
step 205, determining a target synthetic audio in a plurality of synthetic audios;
and step 206, replacing the first audio with the target synthetic audio for playing.
Steps 201 to 203 in this embodiment are substantially the same as those in the first embodiment, and are not described again to avoid repetition. The main difference is in step 204 to step 206, and the analysis is performed in step 204 to step 206.
For step 204, before the original file played online is played, a user instruction is received, the file is adjusted to a second mode, the second mode is a replacement mode, and the synthetic audio is selected to replace the corresponding portion of the original file. If the original file played online is a movie, then in the video playing process, 5 seconds before the occurrence of the speech every time, recommending a plurality of synthetic audios corresponding to the current speech at the bottom of the picture.
In addition, in the second mode, the display is performed according to a ranking of scores of the plurality of synthetic audios. If the user-defined information of the user is text information, scoring the composite audio according to the number of words contained in the user-defined information of the user and the number of words contained in the speech corresponding to the first audio, wherein the closer the number of words is to the number of words of the speech corresponding to the first audio, the higher the score is; if the user-defined information of the user is audio information, scoring the synthesized audio according to the voiceprint characteristics contained in the user-defined information of the user and the voiceprint characteristics of the role sound in the first audio; the voiceprint features include: the higher the similarity of the synthesized audio and the first audio, the higher the score.
For example, if the basis for generating the synthetic audio includes text information, the word corresponding to the first audio is 10 words, the number of words in the text information in the synthetic audio p1 is 8, and the number of words in the text information in the synthetic audio p2 is 9, the score of the synthetic audio p2 is higher than that of the synthetic audio p 1. And if the synthesized audio is obtained according to the voice information and the audio track mixing of the character-removing sound track, carrying out voiceprint extraction on the voice information C2 recorded by the user to obtain the voiceprint characteristic VF2 of the user. The amplitude and loudness of each word are detected, and the smaller the difference between the amplitude and loudness of the character sound C0 and the voice information C2 recorded by the user is, the higher the score is. After each word is scored, the score sum of all the words is accumulated to be a similarity score S1, the voiceprint characteristic VF2 of the user record and the voiceprint characteristic VF0 of the role sound C0 are compared, firstly, the emotion characteristic recognition of the role sound C0 is carried out, frame division processing is carried out by utilizing a Hamming window according to a fixed interval, each frame is subjected to Fourier transform to obtain a spectrogram (calculated by utilizing a signal amplitude value), and the spectrogram is processed by a Mel filter to obtain a Mel spectrogram. The sound spectrums of all the frames are placed in a matrix, the matrix is input into an AudioSet model for reasoning and comparison to obtain a comprehensive emotion matching value, then an emotion matching value is obtained for voice information C2 recorded by a user in the same process, the smaller the difference between the emotion matching value and the emotion matching value is, the higher the score is, and an emotion value score S2 is obtained. The final score is a S1+ b S2, where a + b is 1.
For steps 205 and 206, after the user selects a recommended synthesized audio, the audio track of the original video file is read, the original sound corresponding to the playing duration is masked, and the target synthesized audio clip is played. And simultaneously, a button is provided on the picture, and a user can quit playing of the target synthetic audio by one button after clicking and switch back to the original file played online.
In one example, after the target synthetic audio is played, the score of the synthetic audio selected by the user is updated according to the playing state, the complete playing time and the played time of the synthetic audio selected by the user. If the user clicks the button for restoring the original speech before the playing of the target synthetic audio is finished, recording the played time, obtaining a preference proportion F smaller than 1 according to the played time/the complete time of the speech, updating the scores of the target and the synthesized audio, and obtaining the updated score as the original score- (the original score F0.01); if the user clicks and switches the speech-line bullet screen and the complete playing is completed, the speech-line bullet screen is recorded as an effective click, and the updated score is the original score (100+ click times)/100.
And reordering the synthesized audio according to the updated score.
After the synthetic audio is obtained, the synthetic audio is scored according to a preset rule, and in a second mode, namely a recommendation replacement mode, the synthetic audio is sequentially recommended from high to low according to a scoring result, so that a user can select the synthetic audio for replacing the synthetic audio played by the original file, the propagation range of the synthetic audio created secondarily is expanded, meanwhile, due to the fact that a scoring mechanism is introduced, the quality of the synthetic file recommended to the user can be guaranteed, and user experience is improved.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the present invention relates to an audio processing apparatus, as shown in fig. 4, including:
the acquisition module 301: acquiring a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
the definition module 302: acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
a synthesis module 303: and acquiring synthetic audio for replacing the first audio according to the user-defined information and the de-horn audio track of the user.
For the obtaining module 301, obtaining a user instruction in a first mode to determine a first audio, and denoising the first audio to obtain a decharaped tone track of the first audio to obtain time node information of the first audio; or preparing a complete de-role audio track of the original file before the original file is played, directly determining the de-role audio track of the first audio through a user instruction, and obtaining the time node information of the first audio.
For the definition module 302: after obtaining the audio track of the character-removing sound, it is also necessary to receive user-defined information, such as: the input text content or the recorded voice segment is used as a material for secondary creation, and the recording of the voice segment requires a user to start related authorization.
For the synthesis module 303: detecting the matching degree of the user-defined information and the first audio according to the user-defined information, wherein the matching degree meets the requirement and the creation can be completed; obtaining a secondary creative work synthetic audio through a character audio track and user-defined information, wherein the synthetic audio is used for replacing a first audio; and after the synthetic audio is obtained, scoring the synthetic audio, and sequencing the provided multiple synthetic audios according to the scoring of the synthetic audio.
After the synthesis module 303, the method further includes: providing a module: providing a plurality of synthesized audios in a second mode; and a replacement module: and acquiring a selection instruction of a user, determining a target synthetic audio in the plurality of synthetic audios, and replacing the first audio with the target synthetic audio for playing.
For the providing module, if the original file played online is a movie, a plurality of synthetic audios corresponding to the current speech are recommended at the bottom of the picture 5 seconds before the speech appears each time when the video is played, and the recommendation sequence carries out recommendation from high to low according to the scores of the previous synthetic audios. For the replacement module, after a user selects a certain recommended synthesized audio, the audio track of the original video file is read, the original sound corresponding to the playing time length is shielded, and the target synthesized audio clip is played. And after the target synthetic audio is played, updating and grading the synthetic audio selected by the user according to the playing state, the complete playing time and the played time of the synthetic audio selected by the user.
It should be understood that this embodiment is a system example corresponding to the above embodiment, and that this embodiment can be implemented in cooperation with the above embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.
It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.
A fourth embodiment of the invention is directed to a terminal, as shown in fig. 5, comprising at least one processor 401; and the number of the first and second groups,
a memory 402 communicatively coupled to the at least one processor 401; wherein,
the memory 402 stores instructions executable by the at least one processor to enable the at least one processor to perform the audio processing method described above.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. An audio processing method, comprising:
obtaining a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
and acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the character-tone-removing audio track.
2. The audio processing method according to claim 1, further comprising, after the obtaining of the synthesized audio for replacing the first audio:
providing a plurality of the synthesized audio in a second mode;
determining a target synthetic audio among a plurality of the synthetic audios;
and replacing the first audio with the target synthetic audio for playing.
3. The audio processing method according to claim 2, wherein said providing a plurality of said synthesized audios in the second mode comprises:
in the second mode, display is performed according to a ranking of scores of a plurality of synthetic audios.
4. The audio processing method of claim 3, wherein the synthesized audio is scored by:
if the user-defined information of the user is text information, scoring the synthesized audio according to the number of words contained in the user-defined information of the user and the number of words contained in the speech corresponding to the first audio;
if the user-defined information of the user is audio information, scoring the synthesized audio according to the voiceprint characteristics contained in the user-defined information of the user and the voiceprint characteristics of the role sound in the first audio; the voiceprint features include: amplitude, loudness, and emotional characteristics.
5. The audio processing method of claim 3, wherein after replacing the first audio with the target synthesized audio for playing, comprising:
and updating the scores of the synthetic audio according to the playing state, the complete playing time and the played time of the synthetic audio.
6. The audio processing method according to any one of claims 1 to 5, wherein before the obtaining the synthesized audio according to the user-defined information and the de-role audio track, further comprising:
detecting the matching degree of the user-defined information of the user and the first audio, and if the matching degree reaches the standard, executing to obtain a synthetic audio according to the user-defined information of the user and the character-removing sound track;
if the user-defined information of the user is text information, inquiring the word number of the text information, and taking the matching degree of the word number of the text information and the word number of the speech table corresponding to the first audio as the matching degree of the user-defined information and the first audio;
and if the user-defined information of the user is voice information, inquiring the duration of the voice information, and taking the matching degree of the duration of the voice information and the duration of the first audio as the matching degree of the user-defined information and the first audio.
7. The audio processing method according to any one of claims 1 to 5, wherein said obtaining synthetic audio for replacing the first audio according to the user-defined information of the user and the de-role audio track comprises:
if the user-defined information of the user is text information, converting the text information into standard audio to obtain a characteristic vector of the standard audio;
acquiring a feature vector of a character sound in the first audio;
obtaining the synthetic audio according to the feature vector of the standard audio, the feature vector of the character sound in the first audio and the character sound removing audio track;
and if the user-defined information is voice information, obtaining the synthesized audio according to the voice information and the character-removing sound track.
8. An audio processing apparatus comprising:
an acquisition module: obtaining a de-horn audio track of a first audio in a first mode, wherein the first audio is an audio in an original file played online;
a definition module: acquiring user-defined information of a user; the user-defined information comprises: text information or voice information;
a synthesis module: and acquiring a synthetic audio for replacing the first audio according to the user-defined information of the user and the character-tone-removing audio track.
9. A terminal, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio processing method of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the audio processing method of any one of claims 1 to 7.
CN202011511470.6A 2020-12-18 2020-12-18 Audio processing method, device, terminal and storage medium Pending CN112509538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011511470.6A CN112509538A (en) 2020-12-18 2020-12-18 Audio processing method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011511470.6A CN112509538A (en) 2020-12-18 2020-12-18 Audio processing method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN112509538A true CN112509538A (en) 2021-03-16

Family

ID=74922498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011511470.6A Pending CN112509538A (en) 2020-12-18 2020-12-18 Audio processing method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN112509538A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046071A1 (en) * 2001-09-06 2003-03-06 International Business Machines Corporation Voice recognition apparatus and method
CN105828220A (en) * 2016-03-23 2016-08-03 乐视网信息技术(北京)股份有限公司 Method and device of adding audio file in video file
CN108305636A (en) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 A kind of audio file processing method and processing device
CN108924583A (en) * 2018-07-19 2018-11-30 腾讯科技(深圳)有限公司 Video file generation method and its equipment, system, storage medium
US20180349495A1 (en) * 2016-05-04 2018-12-06 Tencent Technology (Shenzhen) Company Limited Audio data processing method and apparatus, and computer storage medium
CN110189741A (en) * 2018-07-05 2019-08-30 腾讯数码(天津)有限公司 Audio synthetic method, device, storage medium and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046071A1 (en) * 2001-09-06 2003-03-06 International Business Machines Corporation Voice recognition apparatus and method
CN105828220A (en) * 2016-03-23 2016-08-03 乐视网信息技术(北京)股份有限公司 Method and device of adding audio file in video file
US20180349495A1 (en) * 2016-05-04 2018-12-06 Tencent Technology (Shenzhen) Company Limited Audio data processing method and apparatus, and computer storage medium
CN108305636A (en) * 2017-11-06 2018-07-20 腾讯科技(深圳)有限公司 A kind of audio file processing method and processing device
CN110189741A (en) * 2018-07-05 2019-08-30 腾讯数码(天津)有限公司 Audio synthetic method, device, storage medium and computer equipment
CN108924583A (en) * 2018-07-19 2018-11-30 腾讯科技(深圳)有限公司 Video file generation method and its equipment, system, storage medium

Similar Documents

Publication Publication Date Title
JP6855527B2 (en) Methods and devices for outputting information
CN107918653B (en) Intelligent playing method and device based on preference feedback
CN106898340B (en) Song synthesis method and terminal
CN107464555B (en) Method, computing device and medium for enhancing audio data including speech
CN110211556B (en) Music file processing method, device, terminal and storage medium
CN103597543A (en) Semantic audio track mixer
CN113691909B (en) Digital audio workstation with audio processing recommendations
CN111147871B (en) Singing recognition method and device in live broadcast room, server and storage medium
CN112165647B (en) Audio data processing method, device, equipment and storage medium
CN109710799B (en) Voice interaction method, medium, device and computing equipment
CN115938338A (en) Speech synthesis method, device, electronic equipment and readable storage medium
Venkatesh et al. Artificially synthesising data for audio classification and segmentation to improve speech and music detection in radio broadcast
CN113781989B (en) Audio animation playing and rhythm stuck point identifying method and related device
CN111859008A (en) Music recommending method and terminal
CN109829075A (en) Method and device for intelligently playing music
CN117558259A (en) Digital man broadcasting style control method and device
CN114694629B (en) Voice data amplification method and system for voice synthesis
CN112509538A (en) Audio processing method, device, terminal and storage medium
CN111627417B (en) Voice playing method and device and electronic equipment
CN113032616A (en) Audio recommendation method and device, computer equipment and storage medium
CN114697689A (en) Data processing method and device, electronic equipment and storage medium
CN111429878A (en) Self-adaptive speech synthesis method and device
US20240169962A1 (en) Audio data processing method and apparatus
CN114078464B (en) Audio processing method, device and equipment
US20240325907A1 (en) Method For Generating A Sound Effect

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination