CN115250367A

CN115250367A - Method and apparatus for mixing multimedia files

Info

Publication number: CN115250367A
Application number: CN202111341817.1A
Authority: CN
Inventors: 李林超; 林炳河
Original assignee: Gaoding Xiamen Technology Co Ltd
Current assignee: Gaoding Xiamen Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-10-28

Abstract

Embodiments of the present disclosure provide a method, apparatus, and computer-readable storage medium storing a computer program for mixing multimedia files. In the method, a plurality of multimedia files are obtained. Wherein the plurality of multimedia files comprises a plurality of audio files. The plurality of multimedia files are then decoded in parallel. The audio portion of the decoded multimedia file is formatted in parallel. And storing the format-processed audio parts in corresponding buffer queues respectively in parallel. A number of buffer queues having audio data stored therein that exceeds a threshold size is determined. And if the number of the buffer queues in which the audio data exceeding the threshold size are stored is larger than the threshold number, overlapping the data in the buffer queues according to the target output file format to form an overlapped multimedia file.

Description

Method and apparatus for mixing multimedia files

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and apparatus for mixing multimedia files.

Background

With the development of multimedia technology, adaptation of multimedia files (e.g., video files and audio files) is increasingly required in a plurality of application fields such as teaching, entertainment, communication, and the like. Sometimes it may be desirable to overlay audio and video to form a mixed multimedia file (also known as an audio-video mix file).

Disclosure of Invention

Embodiments described herein provide a method, apparatus, and computer-readable storage medium storing a computer program for mixing multimedia files.

According to a first aspect of the present disclosure, a method for mixing multimedia files is provided. In the method, a plurality of multimedia files are obtained. Wherein the plurality of multimedia files comprises a plurality of audio files. The plurality of multimedia files are then decoded in parallel. The audio portion of the decoded multimedia file is formatted in parallel. And storing the format-processed audio parts in corresponding buffer queues respectively in parallel. The number of buffer queues in which audio data exceeding a threshold size is stored is determined. And if the number of the buffer queues in which the audio data exceeding the threshold size are stored is larger than the threshold number, overlapping the data in the buffer queues according to the target output file format to form an overlapped multimedia file.

In some embodiments of the present disclosure, the plurality of multimedia files further comprises at least one video file. In the decoding of the plurality of multimedia files in parallel, it is determined whether the target output file format is a video file format. If the target output file format is a video file format, all of the plurality of multimedia files are decoded in parallel. If the target output file format is not a video file format, decoding an audio file of the plurality of multimedia files and an audio portion of the video file in parallel.

In some embodiments of the present disclosure, one of the plurality of multimedia files is designated as a reference file, and a time period for decoding the plurality of multimedia files is determined according to the time period of the reference file.

In some embodiments of the present disclosure, in the decoding of the plurality of multimedia files in parallel, if the target output file format is a video file format, it is determined whether the reference file includes a video part. If the reference file includes a video portion, the video portion is decoded and the decoded video portion is stored in a video buffer queue.

In some embodiments of the present disclosure, in the decoding of the plurality of multimedia files in parallel, if the target output file format is a video file format, it is determined whether a non-reference file other than the reference file among the plurality of multimedia files includes a video part. If the non-reference file includes a video portion, only the audio portion of the non-reference file is decoded.

In some embodiments of the present disclosure, in the step of superimposing data in the buffer queue according to the target output file format, audio data in the buffer queue is mixed, and the mixed audio data is encoded. If the reference file includes a video portion, the video data in the buffer queue is encoded, and the encoded video data is packaged with the encoded audio data.

In some embodiments of the disclosure, the method further comprises: determining whether the stacked multimedia file needs to be transcoded according to the target output file format; and if the determination that the superimposed multimedia file needs transcoding is made, transcoding the superimposed multimedia file according to the target output file format.

In some embodiments of the present disclosure, the reference file is a file that is obtained first among the plurality of multimedia files.

In some embodiments of the disclosure, in the step of formatting the audio portion of the decoded multimedia file in parallel: resampling the plurality of multimedia files such that the following properties are the same for the audio portion of the audio files and the video files in the plurality of multimedia files: sample format, sample rate, number of channels, channel arrangement, and sample size.

In some embodiments of the present disclosure, in the step of decoding the plurality of multimedia files in parallel, independent decoding threads are employed to decode the plurality of multimedia files in parallel.

According to a second aspect of the present disclosure, an apparatus for mixing multimedia files is provided. The apparatus includes at least one processor; and at least one memory storing a computer program. The computer program, when executed by at least one processor, causes an apparatus to obtain a plurality of multimedia files. Wherein the plurality of multimedia files comprises a plurality of audio files. The plurality of multimedia files are then decoded in parallel. The audio portion of the decoded multimedia file is formatted in parallel. And storing the format-processed audio parts in corresponding buffer queues respectively in parallel. A number of buffer queues having audio data stored therein that exceeds a threshold size is determined. And if the number of the buffer queues in which the audio data exceeding the threshold size are stored is larger than the threshold number, overlapping the data in the buffer queues according to the target output file format to form an overlapped multimedia file.

In some embodiments of the present disclosure, the plurality of multimedia files further comprises at least one video file. The computer program, when executed by the at least one processor, causes the apparatus to decode a plurality of multimedia files in parallel by determining whether a target output file format is a video file format; decoding all multimedia files of the plurality of multimedia files in parallel if the target output file format is a video file format; if the target output file format is not a video file format, decoding an audio file of the plurality of multimedia files and an audio portion of the video file in parallel.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to decode the plurality of multimedia files in parallel by: determining whether the reference file includes a video portion if the target output file format is a video file format; if the reference file includes a video portion, the video portion is decoded and the decoded video portion is stored in a video buffer queue.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to decode the plurality of multimedia files in parallel by: determining whether a non-reference file other than the reference file among the plurality of multimedia files includes a video part if the target output file format is a video file format; if the non-reference file includes a video portion, only the audio portion of the non-reference file is decoded.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to overlay data in the buffer queue according to the target output file format by: mixing the audio data in the buffer queue; encoding the audio data after the audio mixing; if the reference file includes a video portion, the video data in the buffer queue is encoded, and the encoded video data and the encoded audio data are packaged together.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to further: determining whether the stacked multimedia file needs transcoding according to the target output file format; and if the overlapped multimedia file is determined to need transcoding, transcoding the overlapped multimedia file according to the target output file format.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to format the audio portion of the decoded multimedia file in parallel by: resampling the plurality of multimedia files such that the following properties of the audio portion in the audio files and the video files in the plurality of multimedia files are the same: sample format, sample rate, number of channels, channel arrangement, and sample size.

In some embodiments of the disclosure, the computer program, when executed by the at least one processor, causes the apparatus to decode the plurality of multimedia files in parallel by: independent decoding threads are employed to decode the plurality of multimedia files in parallel.

According to a third aspect of the present disclosure, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when executed by a processor, performs the steps of the method according to the first aspect of the present disclosure.

Drawings

To more clearly illustrate the technical aspects of the embodiments of the present disclosure, reference will now be made in brief to the accompanying drawings of the embodiments, it being understood that the drawings described below relate only to some embodiments of the disclosure and are not limiting thereof, and wherein:

fig. 1 is an exemplary flow diagram of a method for mixing multimedia files according to an embodiment of the present disclosure;

FIG. 2 is an exemplary flowchart of the steps of decoding multiple multimedia files in parallel according to an embodiment of the present disclosure;

FIG. 3 is an exemplary flow diagram of a process of decoding multiple multimedia files in parallel in the embodiment shown in FIG. 1; and

fig. 4 is a schematic block diagram of an apparatus for mixing multimedia files according to an embodiment of the present disclosure.

The elements in the drawings are schematic and not drawn to scale.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are also within the scope of protection of the disclosure.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the presently disclosed subject matter belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the specification and relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Conventional audiovisual editing software is typically used to mix two audio files or to mix one audio file and one video file. Embodiments of the present disclosure provide a method for mixing multimedia files that is capable of quickly mixing more than two audio files and/or video files.

Fig. 1 is an exemplary flowchart of a method for mixing multimedia files according to an embodiment of the present disclosure. A method for mixing multimedia files is described below with reference to fig. 1.

In the method 100, at block S102, a plurality of multimedia files are obtained. Wherein the plurality of multimedia files may include a plurality of audio files. In some embodiments, the plurality of multimedia files may further include at least one video file. In the context of the present disclosure, a video file may include a video portion and an audio portion. In some embodiments, a video file, such as a silent video, may also include only video portions. Video portion refers herein to a digitized representation of a set of consecutive images. The set of consecutive images comprises more than e.g. 24 images within one second. Audio files and audio portions refer herein to digitized representations of sound signals, which may include digitized representations of speech, music, and the like, as well as combinations thereof. In embodiments of the present disclosure, after a video file is obtained, the video file may be separated into a video portion (or referred to as a video stream) and an audio portion (or referred to as an audio stream). In the context of the present disclosure, reference to "audio portion" may refer to the audio portion in a video file, as well as to an audio file. In some embodiments of the present disclosure, reference to an "audio file" may refer to an audio file in the general sense, as well as to an audio portion of a video file. Thus, "the plurality of multimedia files may comprise a plurality of audio files" may also be understood to mean that the plurality of multimedia files may comprise a plurality of audio portions, such as audio portions in a plurality of video files.

Fig. 2 schematically shows an exemplary block diagram of mixing 1 video part and 3 audio parts.

Wherein processing with respect to the video portion is depicted with a dashed box to indicate that the plurality of multimedia files does not necessarily include a video file in this disclosure. In addition, the number of video parts and audio parts is also illustrative. Those skilled in the art will appreciate that the methods of the present disclosure may be applied to other numbers of video portions and audio portions as well.

At block S104 of fig. 1, the plurality of multimedia files are decoded in parallel. In some embodiments of the present disclosure, in decoding the plurality of multimedia files in parallel, independent decoding threads are employed to decode the plurality of multimedia files in parallel.

Fig. 3 shows an exemplary flow diagram of a process of decoding multiple multimedia files in parallel in the embodiment shown in fig. 1. At block S302, it is determined whether the target output file format is a video file format. For example, whether the target output file format is a video file format may be determined by a suffix of the target output file name.

If it is determined at block S304 that the target output file format is a video file format ("yes" at block S304), all of the plurality of multimedia files are decoded in parallel at block S306.

In some embodiments, one of the plurality of multimedia files may be designated as a reference file. In one example, the reference file may be a multimedia file arranged at the top in a User Interface (UI) of a device to which the method 100 according to an embodiment of the present disclosure is applied, or a multimedia file that a user first loads. In another example, the location of the reference file may also be marked on the user interface, with the user loading the desired reference file into the marked location to specify the reference file.

If the target output file format is a video file format, it is further determined whether the reference file includes a video portion, i.e., whether the reference file is a video file. In the example shown in fig. 2, at block 202, the format of the obtained multimedia file is parsed. Whether the reference file includes a video part is determined by the format of the parsed reference file.

And if the reference file comprises the video part, starting a video decoding thread to decode the video part of the reference file. Fig. 2 shows an example in which the reference file includes a video portion. In FIG. 2, the video decoding thread 212 is employed to decode video portion A of the reference file to obtain video data 214, and the audio decoding thread 222 \u1 is employed to decode audio portion B of the reference file to obtain the original audio data 224 \u1. For multimedia file C parsed into an audio file at block 202, audio decoding thread 222 \u2 is employed to decode the audio file C to obtain original audio data 224 \u2. For multimedia file D, which was parsed into an audio file at block 202, audio decoding thread 222 \u3 is employed to decode the audio file D to obtain original audio data 224 \u3. The

original audio data

224, 2 and 224, 3 may be original Pulse Code Modulation (PCM) data.

Since each of the video part and the audio part is decoded by using an independent decoding thread, the plurality of video parts and the plurality of audio parts (i.e., the plurality of multimedia files) can be decoded in parallel, thereby shortening the overall decoding time of the plurality of multimedia files and improving the decoding efficiency.

In some embodiments, the duration of decoding each of the plurality of multimedia files is determined from the duration of the reference file. For example, when the decoding of the reference file is completed, the decoding of the other files is stopped.

In some embodiments of the present disclosure, in decoding the plurality of multimedia files in parallel, if the target output file format is a video file format, it may be further determined whether a non-reference file other than the reference file among the plurality of multimedia files includes a video part. In one example, if the non-reference file includes a video portion, only the audio portion of the non-reference file is decoded, and the video portion of the non-reference file is not decoded. Such that the output file will only include the video portion of the reference file.

Returning to FIG. 3, if it is determined at block S304 that the target output file format is not a video file format ("NO" at block S304), then an audio file of the plurality of multimedia files and an audio portion of the video file are decoded in parallel at block S308. So that the output file will only comprise the audio part.

The operation design of the existing commonly used audio and video editing software is professional, the difficulty of a novice user is high, and the cost for training the novice user to skillfully master the software operation is high. The method 100 for mixing multimedia files according to the embodiment of the present disclosure simplifies the operation difficulty of mixing multimedia files by introducing a reference file. The user can easily determine how long to decode the plurality of multimedia files and how to process the video files in the multimedia files by simply specifying the reference file (e.g., first loading the desired file as the reference file) and the target output file format. For example, if a user desires to mix a plurality of audio files and video files into one video file, he only needs to take a certain video file as a reference text and set the target output file format to the video file format. Alternatively, if the user desires to mix a plurality of audio files and video files into one audio file, he only needs to set the target output file format to the audio file format.

Returning to fig. 1, at block S106, the audio portion of the decoded multimedia file is formatted in parallel. For example, the plurality of multimedia files may be resampled using the same parameters such that the following properties of the audio portion of the audio file and the video file of the plurality of multimedia files are the same: sample format, sample rate, number of channels, channel arrangement, sample size, etc.

In one example, the sample formats of the audio portion of the audio file and the video file in the plurality of multimedia files may both be rearranged to S16. In another example, the sample formats of the audio portion of the audio file and the video file in the plurality of multimedia files may both be rearranged to S16P. In one example, if the audio portion of one or more of the multimedia files includes N channels and the audio portion of the other of the multimedia files includes (N + 1) channels, the audio portion including the N channels may be expanded to (N + 1) channels with the audio data of the expanded channel set to 0.

Methods of making the sample rate of each audio portion the same include, but are not limited to: nearest neighbor interpolation (nearest neighbor interpolation), bilinear interpolation (bilinear interpolation), and cubic convolution interpolation (cubic convolution interpolation).

In the example shown in FIG. 2, raw audio data 224_1 is audio format processed at box 226 \u1, raw audio data 224 \u2 is audio format processed at box 226 \u2, and raw audio data 224 \u3 is audio format processed at box 226 \u3. Performing format processing on three paths of audio data (original audio data 224_1, original audio data 224_2, and original audio data 224 _3) in parallel can improve processing efficiency and save processing time.

At block S108 of fig. 1, the format-processed audio portions are stored in parallel in corresponding buffer queues, respectively. In the example of fig. 2, one audio buffer queue is set for each way of audio data. For example, the three-way audio data is stored in audio buffer queues 228_1, 228_2, and 228_3, respectively. Because the three paths of audio data are respectively stored in different audio buffer queues, the audio data do not generate memory pressure to each other, and the storage errors of other paths of audio data caused by the storage position error of a certain path of audio data are avoided.

Further, as shown in fig. 2, in the case where the reference file includes a video part, the decoded video data in the video part is independently stored in the video buffer queue 216.

At block S110 of fig. 1, a number N of buffer queues in which audio data exceeding a threshold size is stored is determined. In some embodiments, if the size of the audio data in the audio buffer queue exceeds the threshold size, it indicates that the audio data in the audio buffer queue is enough, and the mixing (superimposing) operation can be performed. In one example, the threshold size may be 10B, 20B, 30B, or other values.

If it is determined at block S112 that the number N of buffer queues in which audio data exceeding the threshold size is stored is greater than the threshold number Nth ("yes" at block S112), the data in the buffer queues are superimposed according to the target output file format to form a superimposed multimedia file at block S114. In the case where the reference file includes a video file and the target output file format is a video file, as shown in fig. 2, overlaying the data in the buffer queue according to the target output file format may further include overlaying the video data in the video buffer queue 216.

In one example, the threshold number may be 1. That is, when audio data in one audio buffer queue has exceeded a threshold size, data in the buffer queues may begin to be superimposed. In another example, the threshold number may also be the total number of audio buffer queues. In the example of fig. 2, the total number of audio buffer queues is 3, so the threshold number may also be 3. That is, when the audio data in all the audio buffer queues has exceeded the threshold size, the data in the buffer queues may begin to be superimposed. In yet another example, the threshold number may be any value between 1 and the total number of audio buffer queues.

In some embodiments of the present disclosure, as described above, audio data in the buffer queue is mixed in the process of superimposing data in the buffer queue according to the target output file format. Mixing may include normalizing the audio data in the buffer queue and adding values of the audio data after the normalization. The mixed audio data may then be encoded in a target output file format. If the reference file includes a video portion and the target output file is in the format of a video file, the process of overlaying the data in the buffer queue may further include: and coding the video data in the buffer queue, and packaging the coded video data and the coded audio data together according to a target output file format.

In some embodiments of the present disclosure, the method 100 may further comprise: determining whether the stacked multimedia file needs to be transcoded according to the target output file format; and if the overlapped multimedia file is determined to need transcoding, transcoding the overlapped multimedia file according to the target output file format.

Fig. 4 shows a schematic block diagram of an apparatus 400 for mixing multimedia files according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 may include a processor 410 and a memory 420 in which computer programs are stored. The computer program, when executed by the processor 410, causes the apparatus 400 to perform the steps of the method 100 as shown in fig. 1. In one example, the apparatus 400 may be a computer device or a cloud computing node. The device 400 may obtain a plurality of multimedia files. Wherein the plurality of multimedia files comprises a plurality of audio files. The apparatus 400 may then decode the plurality of multimedia files in parallel. The device 400 may format the audio portion of the decoded multimedia file in parallel. The apparatus 400 may store the format-processed audio portions in corresponding buffer queues, respectively, in parallel. The apparatus 400 may determine a number of buffer queues in which audio data exceeding a threshold size is stored. If the number of buffer queues in which audio data exceeding the threshold size is stored is greater than the threshold number, the apparatus 400 may superimpose the data in the buffer queues according to the target output file format to form a superimposed multimedia file.

In some embodiments of the present disclosure, the plurality of multimedia files further comprises at least one video file. The apparatus 400 may determine whether the target output file format is a video file format. If the target output file format is a video file format, the apparatus 400 may decode all of the plurality of multimedia files in parallel. If the target output file format is not a video file format, the apparatus 400 may decode audio files in the plurality of multimedia files and audio portions in the video files in parallel.

In some embodiments of the present disclosure, if the target output file format is a video file format, the apparatus 400 may determine whether the reference file includes a video portion. If the reference file includes a video portion, the apparatus 400 may decode the video portion and store the decoded video portion in a video buffer queue.

In some embodiments of the present disclosure, if the target output file format is a video file format, the apparatus 400 may determine whether a non-reference file of the plurality of multimedia files other than the reference file includes a video portion. If the non-reference file includes a video portion, the apparatus 400 may decode only the audio portion of the non-reference file.

In some embodiments of the present disclosure, the apparatus 400 may mix audio data in the buffer queue and encode the mixed audio data. If the reference file includes a video portion, the apparatus 400 may encode the video data in the buffer queue and encapsulate the encoded video data with the encoded audio data.

In some embodiments of the present disclosure, the apparatus 400 may determine whether the superimposed multimedia file needs to be transcoded according to the target output file format. If it is determined that the stacked multimedia file needs to be transcoded, the apparatus 400 may transcode the stacked multimedia file according to the target output file format.

In some embodiments of the disclosure, the apparatus 400 may resample the plurality of multimedia files such that the following properties of the audio portion of the audio file and the video file of the plurality of multimedia files are the same: sample format, sample rate, number of channels, channel arrangement, and sample size.

In some embodiments of the present disclosure, the apparatus 400 may employ independent decoding threads to decode the plurality of multimedia files in parallel.

In an embodiment of the present disclosure, the processor 410 may be, for example, a Central Processing Unit (CPU), a microprocessor, a Digital Signal Processor (DSP), a processor based on a multi-core processor architecture, or the like. The memory 420 may be any type of memory implemented using data storage technology including, but not limited to, random access memory, read only memory, semiconductor-based memory, flash memory, disk memory, and the like.

Further, in an embodiment of the present disclosure, the apparatus 400 may also include an input device 430, such as a microphone, a keyboard, a mouse, etc., for inputting the plurality of multimedia files to be mixed. In addition, the apparatus 400 may further comprise an output device 440, such as a microphone, a display, etc., for outputting the mixed multimedia file.

In other embodiments of the present disclosure, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program, when executed by a processor, is capable of implementing the steps of the method as shown in fig. 1 to 3.

Since the embodiment of the disclosure decodes multiple multimedia files (data streams) in parallel by using multiple threads and each multimedia file independently uses one buffer queue to relieve the memory pressure, the method and the device for mixing multimedia files can improve the speed of mixing multimedia files.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As used herein and in the appended claims, the singular forms of words include the plural and vice versa, unless the context clearly dictates otherwise. Thus, when reference is made to the singular, it is generally intended to include the plural of the corresponding term. Similarly, the terms "comprising" and "including" are to be construed as being inclusive rather than exclusive. Likewise, the terms "include" and "or" should be construed as inclusive unless such interpretation is explicitly prohibited herein. Where the term "example" is used herein, particularly when it comes after a set of terms, it is merely exemplary and illustrative and should not be considered exclusive or extensive.

Further aspects and ranges of adaptability will become apparent from the description provided herein. It should be understood that various aspects of the present application may be implemented alone or in combination with one or more other aspects. It should also be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Several embodiments of the present disclosure have been described in detail above, but it is apparent that various modifications and variations can be made to the embodiments of the present disclosure by those skilled in the art without departing from the spirit and scope of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method for blending multimedia files, comprising:

obtaining a plurality of multimedia files, wherein the plurality of multimedia files comprises a plurality of audio files;

decoding the plurality of multimedia files in parallel;

formatting the audio portion of the decoded multimedia file in parallel;

storing the audio parts subjected to format processing in corresponding buffer queues respectively in parallel;

determining a number of buffer queues in which audio data exceeding a threshold size is stored; and

and in response to the number of the buffer queues in which the audio data exceeding the threshold size are stored being greater than the threshold number, superimposing the data in the buffer queues according to the target output file format to form a superimposed multimedia file.

2. The method of claim 1, wherein the plurality of multimedia files further comprises at least one video file, and decoding the plurality of multimedia files in parallel comprises:

determining whether the target output file format is a video file format;

in response to the target output file format being a video file format, decoding all of the plurality of multimedia files in parallel; and

in response to the target output file format not being a video file format, decoding an audio file of the plurality of multimedia files and an audio portion of a video file in parallel.

3. The method of claim 2, wherein one of the plurality of multimedia files is designated as a reference file, and the time duration for decoding the plurality of multimedia files is determined according to the time duration of the reference file.

4. The method of claim 3, wherein decoding the plurality of multimedia files in parallel further comprises:

in response to the target output file format being a video file format, determining whether the reference file includes a video portion;

in response to the reference file including a video portion, decoding the video portion, and storing the decoded video portion in a video buffer queue.

5. The method of any of claims 3-4, wherein decoding the plurality of multimedia files in parallel further comprises:

in response to the target output file format being a video file format, determining whether a non-reference file of the plurality of multimedia files other than the reference file includes a video portion;

in response to the non-reference file including a video portion, decoding only an audio portion of the non-reference file.

6. The method of claim 4, wherein overlaying data in the buffer queue according to a target output file format comprises:

mixing the audio data in the buffer queue;

encoding the audio data after the audio mixing; and

in response to the reference file including a video portion, encoding video data in the video buffer queue, and packaging the encoded video data with encoded audio data.

7. The method of any of claims 1-4, further comprising:

determining whether the overlapped multimedia file needs transcoding according to the target output file format; and

and in response to the determination that the overlapped multimedia file needs to be transcoded, transcoding the overlapped multimedia file according to the target output file format.

8. The method of claim 3, wherein the reference file is a file that is obtained first among the plurality of multimedia files.

9. An apparatus for mixing multimedia files, comprising:

at least one processor; and

at least one memory storing a computer program;

wherein the computer program, when executed by the at least one processor, causes the apparatus to perform the steps of the method according to any one of claims 1-8.

10. A computer-readable storage medium storing a computer program, wherein the computer program realizes the steps of the method according to any one of claims 1-8 when executed by a processor.