CN117079671A

CN117079671A - Audio processing method, device, computer equipment and storage medium

Info

Publication number: CN117079671A
Application number: CN202311257571.9A
Authority: CN
Inventors: 陈昌儒; 胡博; 王申剑; 李标
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-11-17

Abstract

The application discloses an audio processing method, an audio processing device, computer equipment and a storage medium, wherein the audio processing method comprises the following steps: dividing audio to be processed into a plurality of audio fragments; performing key frame identification on the audio to be processed to obtain key frames in the audio to be processed; compressing the audio to be processed based on the key frame and the plurality of audio fragments; generating first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to perform audio understanding; and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed. The method can improve the accuracy of understanding long-time audio.

Description

Audio processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio processing method, an apparatus, a computer device, and a storage medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, the large language model (Large Language Model, LLM) that has emerged in recent years has received much attention to understand and answer the questions posed by users more intelligently. Moreover, a large language model is generally used for understanding the audio, but the accuracy of the related art for understanding the audio with a long time length is poor.

Disclosure of Invention

The application provides an audio processing method, an audio processing device, computer equipment and a storage medium, which can improve the accuracy of long-time audio understanding.

In a first aspect, an embodiment of the present application provides an audio processing method, including: dividing audio to be processed into a plurality of audio fragments; performing key frame identification on the audio to be processed to obtain key frames in the audio to be processed; compressing the audio to be processed based on the key frame and the plurality of audio fragments; generating first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to perform audio understanding; and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In a second aspect, an embodiment of the present application provides an audio processing apparatus, including: the system comprises an audio segmentation module, a key frame identification module, an audio compression module, an information generation module and an audio understanding module, wherein the audio segmentation module is used for dividing audio to be processed into a plurality of audio fragments; the key frame identification module is used for carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed; the audio compression module is used for compressing the audio to be processed based on the key frame and the plurality of audio fragments; the information generation module is used for generating first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, and the first prompt information is used for prompting the large language model to understand the audio; the audio understanding module is used for inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In a third aspect, an embodiment of the present application provides a computer apparatus, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the audio processing method provided in the first aspect above.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium having stored therein program code that is callable by a processor to perform the audio processing method provided in the first aspect described above.

According to the scheme provided by the application, the audio to be processed is divided into a plurality of audio fragments, key frame identification is carried out on the audio to be processed to obtain key frames in the audio to be processed, compression processing is carried out on the audio to be processed based on the key frames and the audio fragments, first input information for inputting the large language model is generated based on first prompt information for prompting the large language model to understand the audio and the audio to be processed after compression processing, and then the first input information is input into the large language model to obtain an audio understanding result corresponding to the audio to be processed. Therefore, when the audio to be processed is understood through the large language model, the input information for inputting the large language model is determined according to the compressed audio to be processed after the audio to be processed is compressed according to the identified key frames and the divided audio fragments, and therefore when the long-term audio is understood, the large language model does not need to be input for many times after the audio to be processed is segmented, the large language model can better understand the long-term audio, and accuracy of the long-term audio understanding is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow diagram of an audio processing method according to an embodiment of the application.

Fig. 2 shows a flow diagram of an audio processing method according to another embodiment of the application.

Fig. 3 shows a flow diagram of an audio processing method according to a further embodiment of the application.

Fig. 4 shows a flow diagram of an audio processing method according to a further embodiment of the application.

Fig. 5 shows a flow diagram of an audio processing method according to yet another embodiment of the application.

Fig. 6 shows a flow diagram of an audio processing method according to yet another embodiment of the application.

Fig. 7 shows a flow diagram of an audio processing method according to yet another embodiment of the application.

Fig. 8 shows a block diagram of an audio processing device according to an embodiment of the application.

Fig. 9 is a block diagram of a computer apparatus for performing an audio processing method according to an embodiment of the present application.

Fig. 10 is a memory unit for storing or carrying program codes for implementing an audio processing method according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

With the recent advent of AI technology, which can help computers accurately identify and understand the content in audio, audio understanding through deep learning has emerged, in which useful features can be extracted from a large amount of data by training neural network models. Along with the development of artificial intelligence, audio understanding can be performed through a large language model (Large Language Model, LLM), wherein the large language model refers to a complex artificial neural network trained based on a large amount of data and computing resources, and can learn rich language modes and knowledge, so that accurate response is generated to natural language input.

In the related art, when audio understanding is performed through a large language model, an audio understanding result can be output by the large language model for input contents by extracting features of the original audio, converting the extracted features into a feature space of the large language model, and inputting the feature space into the large language model. However, when the duration of the original audio is long, the number of audio frames in the original audio is large, and the large language model is limited to the content of each input, so that when the actually input original audio is processed, the original audio is divided into a plurality of segments of audio, and then, after feature extraction is performed on each segment of audio, the audio is converted into the feature space of the large language model and then is input into the large language model, so that the audio understanding result corresponding to each segment of audio is obtained. In such a manner, since the audio is divided into a plurality of segments to determine the input information respectively and then input to the large language model, the information of the integrity of the original audio is lost when the large language model carries out audio understanding, and thus the audio is limited to be perceived and understood for a long time, and the granularity of the audio content understanding is relatively coarse, so that the content in the audio cannot be understood finely.

In order to solve the problems, the inventor proposes the audio processing method, the device, the computer equipment and the storage medium provided by the embodiment of the application, which can realize that when the audio to be processed is understood through the large language model, the input information for inputting the large language model is determined according to the identified key frames and the divided audio fragments after the audio to be processed is compressed and processed, so that when the long-term audio is understood, the large language model is not required to be input for many times after the audio to be processed is segmented, thereby the large language model can better understand the audio of the long-term audio, and the accuracy of the long-term audio is improved. The specific audio processing method is described in detail in the following embodiments.

The following describes the audio processing method according to the embodiment of the present application in detail with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flow chart illustrating an audio processing method according to an embodiment of the application. In a specific embodiment, the audio processing method is applied to an audio processing apparatus 700 as shown in fig. 8 and a computer device 100 (fig. 9) configured with the audio processing apparatus 700. The specific flow of the present embodiment will be described below by taking a computer device as an example, and it will be understood that the computer device applied in the present embodiment may be a server, a smart phone, a tablet computer, a smart watch, a notebook computer, etc., which is not limited herein. The following details about the flow shown in fig. 1, the audio processing method may specifically include the following steps:

Step S110: the audio to be processed is divided into a plurality of audio pieces.

In the embodiment of the application, when the computer equipment carries out audio understanding on the audio to be processed, the audio to be processed is required to be divided into a plurality of sections of audio when the duration of the audio to be processed is long, then the plurality of sections of audio are converted into information for inputting a large language model, and then the information is input into the large language model to obtain an audio understanding result corresponding to each section of audio, so that the information of the integrity of the original audio is lost when the large language model carries out audio understanding, and the accuracy of the audio to be processed is insufficient, therefore, the computer equipment can divide the audio to be processed into a plurality of audio fragments according to the audio fragments to be divided and key frames in the audio to be processed which are recognized later, thereby reducing audio frames in the audio to be processed, and further, inputting the large language model for a plurality of times after the audio fragments are not required.

In some embodiments, when the computer device divides the audio to be processed into audio segments, the audio to be processed may be divided according to different sound events in the audio segments, such as a human voice, a bird song, a car whistle, a musical instrument sound, and the like.

In one possible implementation, since audio understanding may be generally performed for speech, multiple audio segments may also be partitioned by identifying a human voice portion, time stamping the human voice portion, and then taking the audio portion composed of adjacent time stamps as one audio segment.

In some embodiments, the computer device may also divide the audio to be processed into a plurality of audio segments according to the target time length, for example, may divide the audio to be processed into a plurality of audio segments having a time length of 10 seconds.

Of course, the specific manner of dividing the audio segments of the audio to be processed may not be limited in the embodiment of the present application.

In some embodiments, when the execution body for executing the audio processing method provided by the embodiment of the present application is a server, the audio to be processed may be audio uploaded by the electronic device, or target audio selected from audio stored in the server according to a selection instruction sent by the electronic device; when the execution main body for executing the audio processing method provided by the embodiment of the application is an electronic device, such as a smart phone, a tablet personal computer, a notebook computer and the like, the audio to be processed can be audio locally stored in the electronic device or audio downloaded from a server.

Step S120: and carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed.

In the embodiment of the application, when the computer equipment carries out audio understanding on the audio to be processed, the computer equipment can also carry out key frame identification on the audio to be processed so as to compress the audio to be processed according to the identified key frames in the audio to be processed, thereby reducing the audio frames in the audio to be processed, and further, the audio is not required to be segmented and then input a large language model for many times. The key frames in the audio to be processed may be audio frames with abrupt changes in pitch period, energy, etc., so that the key frames can be better represented in their neighboring audio frames.

In some embodiments, the audio to be processed may be extracted by performing audio feature extraction, and determining a pitch parameter value of each audio frame according to the extracted audio feature, where the pitch parameter value may include a pitch period, a pitch energy, and the like; then, according to the pitch parameter values of each audio frame, aiming at each two adjacent audio frames, obtaining the change degree between the pitch parameter values of the two adjacent audio frames; if the change degree of the pitch parameter value of the audio frame of the next frame relative to the pitch parameter value of the audio frame of the previous frame is larger than the target threshold value, the audio frame of the next frame can be determined to be a key frame; if the pitch parameter value of the subsequent frame of audio frame is not more than the target threshold with respect to the pitch parameter value of the previous frame of audio frame, the subsequent frame of audio frame is not a key frame.

In some embodiments, the key frame recognition is performed on the audio to be processed through a pre-trained key frame recognition model, so that the key frame in the audio to be processed is obtained. When key frames are identified for the audio to be processed, each frame of audio frame of the audio to be processed can be input into the key frame identification model so as to obtain an identification result of the key frame identification model on each frame of audio frame; and according to the identification result of each frame of audio frame, determining the key frame in the audio to be processed. The specific type of the key frame recognition model may not be limited, and for example, the key frame recognition model may be a deep neural network (Deep Neural Network, DNN), a Long Short-Term Memory (LSTM), a Transformer network, and the like.

In one possible implementation, the key frame recognition model may include, for each frame of audio frame, whether the audio frame is a key frame, e.g., the recognition result may include 1 and 0, if the recognition result is 1, indicating that the audio frame is a key frame, and if the recognition result is 0, indicating that the audio frame is not a key frame. Therefore, according to the identification result of the key frame identification model for each frame of audio frame, the audio frames in the audio to be processed can be determined to be key frames.

In one possible implementation, the recognition result of the key frame recognition model for each frame of audio frame may include a key frame probability that characterizes the probability that the audio frame is a key frame, and the greater the key frame probability, the greater the likelihood that the audio frame is a key frame. In this manner, the key frame probability corresponding to each frame of audio frame may be compared to a key frame threshold; according to the comparison result, if the key frame probability is greater than the key frame threshold, the audio frame can be determined to be the key frame; if the key frame probability is not greater than the key frame threshold, it may be determined that the audio frame is not a key frame. Therefore, the key frames in the audio to be processed can be determined by further judging the corresponding identification result of each frame of audio frame.

In one possible implementation, the above key frame recognition model may be trained by: acquiring sample audio, wherein each frame of audio frame in the sample audio can be marked with a key frame label, and the key frame label is used for representing whether the audio frame is a key frame or not; when a key frame recognition model is obtained based on sample audio training, each frame of sample audio can be input into an initial recognition model to obtain a recognition result of the initial recognition model for the sample audio frame, and then a loss value corresponding to the initial recognition model is determined according to the recognition result of each frame of sample audio frame and a key frame label marked by each frame of sample audio frame; after determining the loss value corresponding to the initial recognition model, performing iterative training on the initial recognition model according to the loss value to obtain a final key frame recognition model. When determining the loss value corresponding to the initial recognition model, the loss value can be determined according to the difference between the recognition result corresponding to each frame of sample audio frame and the marked key frame label.

Optionally, when the initial recognition model is iteratively trained according to the determined loss value, the model parameters of the initial recognition model can be adjusted according to the calculated loss value; and repeatedly inputting the sample audio frame of the sample audio to the initial recognition model to obtain a recognition result output by the initial recognition model to the input sample audio frame, determining a loss value corresponding to the initial equipment model based on the recognition result of the sample audio frame and the marked key frame label, and adjusting model parameters of the initial recognition model according to the loss value until the training ending condition is met to obtain the trained key frame recognition model. The training end conditions of the iterative training may include: the number of iterative training reaches the target number; or the total loss value of the initial recognition model satisfies the set condition.

Of course, the specific manner of performing the key frame recognition on the audio to be processed may not be limited in the embodiment of the present application.

Step S130: and compressing the audio to be processed based on the key frame and the plurality of audio fragments.

In the embodiment of the application, after the key frames in the audio to be processed are identified and obtained and the audio to be processed is divided into a plurality of audio fragments, the audio to be processed can be compressed according to the divided audio fragments and the key frames identified above. The compression processing in the embodiment of the application refers to reducing the audio frames in the audio to be processed. It can be appreciated that, since the key frame is representative in the adjacent multiple audio frames, the multiple audio clips can be compressed based on the key frame, so that the number of the audio frames is reduced, and meanwhile, the information in the audio to be processed can be effectively extracted during subsequent audio understanding.

In some implementations, a determination may be made for each of the above plurality of audio segments as to whether a key frame is included in the audio segment; if the audio segment includes a key frame, the key frame and a first target number of audio frames adjacent to the key frame can be obtained from the audio segment as part of the audio frames in the audio segment; if no key frames are included in the audio segment, a second target number of audio frames may be extracted from the audio segment according to specified extraction rules. The specific values of the first target number and the second target number may not be limited, and may be, for example, 5 frames, 10 frames, 20 frames, or the like.

In one possible implementation, the second target number may be smaller than the first target number, that is, for an audio segment that includes a key frame, the number of partial audio frames extracted may be greater than the number of partial audio frames extracted in an audio segment that does not include a key frame when extracting partial audio frames. It will be appreciated that since the audio segments that do not include key frames are audio segments that have little variability in pitch period, energy, etc., the information they include is closer, so fewer audio frames can be extracted to further reduce the number of audio frames used in subsequent audio comprehension.

In one possible implementation, in the case that no key frame is included in any audio segment, when the second target number of audio frames is extracted from the audio segment according to the specified extraction rule, the second target number of audio frames may be randomly extracted from the audio segment.

In a possible implementation manner, in a case that no key frame is included in any audio segment, when the second target number of audio frames are extracted from the audio segment according to the specified extraction rule, the second target number of audio frames meeting the target distribution condition may also be extracted as part of the audio frames in the audio segment. The target distribution condition can be uniformly distributed in the audio fragment, so that information can be effectively extracted from the audio fragment when the subsequent audio is understood.

Of course, in the embodiment of the present application, the specific manner of performing compression processing on the audio to be processed based on the key frame and the plurality of audio clips is not limited, and only the necessary audio frames are ensured to be reserved from the audio to be processed based on the key frame, so as to effectively extract the information in the audio to be processed.

Step S140: and generating first input information for inputting a large language model based on the first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to carry out audio understanding.

In the embodiment of the application, after the compression processing is performed on the audio to be processed, the input information for inputting the large language model can be determined according to all the audio frames in the reserved audio to be processed. The method can generate input information for inputting the large language model according to the first prompt information and the reserved audio frame, and take the input information as the input information. The first prompt information can be a prompt word for prompting the large language model to perform audio understanding, namely prompt, the first prompt information can provide the intention of the user, and after the large language model subsequently understands the first prompt information, the intention of the user can be known and corresponding processing can be performed according to the intention of the user. It can be understood that, because the audio to be processed is compressed by the identified key frames and the divided audio fragments, the input information for inputting the large language model is generated, so that the number of audio frames required for generating the input information can be reduced, and the remaining audio frames can be ensured to provide necessary information for audio understanding, so that the input information for inputting the large language model is only required to be generated once for the audio to be processed, and the integral information of the audio to be processed is reserved in the input information, so that the audio understanding result output by the large language model is more accurate.

In some embodiments, feature extraction may be performed on the audio to be processed after compression processing, and then the extracted features are converted into a feature space of the large language model, that is, the feature space is converted into a text token that is generally input into the large language model, so that after the feature space is spliced with the above first prompt information (also the text token), first input information for inputting the large language model is obtained.

Step S150: and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the application, after the first input information is obtained, the first input information can be input into a large language model, so that an audio understanding result corresponding to the audio to be processed is obtained. The audio understanding result may include a description of audio, an answer after audio understanding is performed on the first prompt information, and the like, and specific content in the audio understanding result may not be limited. The above large language model may be a pre-trained model capable of audio understanding, the large language model may be an AIGC-based generative large language model, and the large language model may generate an audio understanding result according to the above input information. The large language model understands the first prompt information in the first input information according to the first input information, and after the intention of the user is obtained, the audio understanding result can be output aiming at the information extracted from the first input information according to the audio to be processed after the compression processing.

In some embodiments, considering that the large language model has wide capability and numerous audio understanding scenarios, a Low-Rank Adaptation (LoRA) model can be added for the large language model, and sample audio under the corresponding scenario is adopted to train the LoRA model, so that the large language model can better understand the audio under the scenarios. The LoRA model is obtained by fine-tuning a cross attention layer in a UNet module in the large language model; the LoRA model is applied to a cross attention layer in a large language model, and training is carried out by adopting sample audio in a corresponding scene under the condition of freezing parameters of the large language model, so that the LoRA model which can better understand audio in the scene can be obtained.

In some embodiments, when the main execution body of the audio processing method provided by the embodiment of the present application is an electronic device, in the above steps S110 to S150, part of the steps may be executed locally by the electronic device, part of the steps may be submitted to a server by the electronic device, and after the execution by the server, the result is returned. For example, the electronic device may submit a key frame identification request to a corresponding server, and upload the audio to be processed to the server, so that the server performs key frame identification on the audio to be processed, and feeds back the identification result to the electronic device; the method comprises the steps that the electronic equipment locally divides audio to be processed into a plurality of audio fragments, and then compresses the audio to be processed based on key frames and the divided audio fragments; and uploading the compressed audio to be processed and the first prompt information to a server, and after the server generates the first input information for inputting the large language model based on the first prompt information and the compressed audio to be processed, inputting the first input information to the large language model to obtain an audio understanding result corresponding to the audio to be processed, and returning the audio understanding result to the electronic equipment.

According to the audio processing method provided by the embodiment of the application, when the audio to be processed is understood through the large language model, after the audio to be processed is compressed according to the identified key frames and the divided audio fragments, the input information for inputting the large language model is determined according to the compressed audio to be processed, so that when the long-term audio is understood, the large language model is not required to be input for many times after the audio to be processed is segmented, the large language model can better understand the long-term audio, and the accuracy of understanding the long-term audio is improved.

Referring to fig. 2, fig. 2 is a flow chart illustrating an audio processing method according to another embodiment of the application. The audio processing method is applied to the computer device, and will be described in detail with respect to the flowchart shown in fig. 2, and the audio processing method specifically may include the following steps:

step S210: the audio to be processed is divided into a plurality of audio pieces.

In some embodiments, when the audio to be processed is divided into a plurality of audio segments, voice activity detection (Voice activity detection, VAD) may be performed on the audio to be processed to obtain a detection result, and then the audio to be processed is divided into a plurality of audio segments according to the detection result. That is, the division of the audio segments of the audio to be processed is determined from the audio portion of the speech signal present.

In one possible implementation manner, when the audio to be processed is divided into a plurality of audio segments according to the detection result of the voice activity detection, a plurality of target time stamps can be marked on the audio to be processed according to the detection result, wherein the audio between two adjacent target time stamps forms an audio segment, and the target time stamps are located in a starting audio frame and a cut-off audio frame of the audio segment where the voice signal exists in the detection result. That is, the audio to be processed may be divided by marking the time stamp of the audio segment having the voice signal, so that two adjacent target time stamps divide one audio segment, thereby dividing to obtain a plurality of audio segments, and when the audio to be processed is further processed, the audio to be processed may be compressed according to the marked time stamp and the audio segment formed by two adjacent time stamps.

Step S220: and carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed.

In the embodiment of the present application, step S220 may refer to the content of the foregoing embodiment, and is not described herein.

Step S230: and based on the key frame, performing at least one of audio fragment filtering and downsampling on the plurality of audio fragments to obtain the audio to be processed after compression processing, wherein the audio fragment filtering comprises filtering at least one audio fragment in the plurality of audio fragments.

In the embodiment of the application, when the audio to be processed is compressed based on the identified key frame and the plurality of audio fragments, at least one of audio fragment filtering and downsampling can be performed on the plurality of audio fragments based on the key frame, so that the number of audio frames is reduced, and the audio to be processed after the compression processing is obtained.

In some embodiments, when audio segment filtering is performed on the plurality of audio segments based on the identified key frames in the audio to be processed, a first audio segment of the plurality of audio segments that does not include the key frames may be determined, and then the first audio segment is filtered from the plurality of audio segments. That is, for audio to be processed, audio clips that do not include key frames may be deleted directly. It will be appreciated that since the audio segments that do not include key frames are audio segments that do not have a great degree of variability in pitch period, energy, etc., they may be silence segments, nonsensical audio segments that include more closely related information in the audio frames, they may be deleted directly to reduce the number of audio frames in the audio to be processed to a greater extent.

In some implementations, a second audio segment of the plurality of audio segments that includes at least one frame of key frames may be determined when the plurality of audio segments are downsampled based on the identified key frames in the audio to be processed; the second audio segment is downsampled. It will be appreciated that for audio clips that include key frames, they may be downsampled to reduce the number of audio frames in the audio clips so that the necessary audio frames can be preserved to effectively extract information in the audio to be processed.

It should be noted that this embodiment may be combined with the previous embodiment, that is, the compressing processing of the audio to be processed may be performing audio segment filtering on a plurality of audio segments and downsampling the plurality of audio segments, where in combination, it is determined whether to directly filter or downsample according to whether the audio segments include key frames.

In one possible implementation manner, the second audio segment is divided based on the key frames in the second audio segment, so as to obtain a plurality of audio sub-segments corresponding to the second audio segment; and downsampling each audio sub-segment in the plurality of audio sub-segments, wherein the sampling rate corresponding to the audio sub-segment in which the key frame is positioned is higher than the sampling rate corresponding to other audio sub-segments, and the other audio sub-segments are audio sub-segments except the audio sub-segment in which the key frame is positioned in the plurality of audio sub-segments.

In the above embodiment, the audio sub-segments containing the key frames and the audio sub-segments not containing the key frames may be divided from the second audio segment based on the key frames in the second audio segment. That is, for the second audio segment, the audio segment may be further divided according to the position of the key frame in the audio segment, and multiple audio sub-segments may be obtained, so that a part of the audio sub-segments include the key frame, a part of the audio sub-segments do not include the key frame, then the audio sub-segments where the key frame is located are sampled according to the sampling rate corresponding to the audio sub-segments where the key frame is located, and the audio sub-segments where the key frame is not included are sampled according to the sampling rate corresponding to the audio sub-segments where the key frame is not included. Also, since the pitch parameter values of the audio segments not including the key frames are not changed much, the information included in each audio frame is relatively close, so that a relatively low sampling rate can be used to extract relatively few audio frames, while for the audio segments including the key frames, a relatively higher sampling rate can be used to preserve the necessary audio frames to effectively extract the information in the audio to be processed.

Optionally, after the audio sub-segments including the key frame and the audio sub-segments not including the key frame are divided from the second audio segment based on the key frame in the second audio segment, if the audio sub-segments including the key frame are two adjacent audio sub-segments, the duration of the distance between the key frames in the two audio sub-segments may be determined, and if the duration of the distance is less than the duration threshold, the two audio sub-segments may be combined into one audio sub-segment, so that after sampling, effective information for audio understanding can be better retained.

In one possible implementation, the above identified key frames may include a primary key frame and a secondary key frame, where the primary key frame is an audio frame having a pitch parameter value that is greater than a first threshold with respect to a previous frame audio frame, the secondary key frame is an audio frame having a pitch parameter value that is greater than a second threshold with respect to a previous frame audio frame, and the first threshold is greater than the second threshold, that is, the pitch parameter value of the primary key frame is greater than a threshold with respect to a previous frame audio frame, and the pitch parameter value of the secondary key frame is greater than a threshold with respect to a previous frame audio frame. In the above embodiments, the sampling rate of the audio sub-segments including the primary key frames may be greater than the sampling rate of the audio sub-segments including the secondary key frames, thereby enabling better retention of effective information for audio understanding.

In some embodiments, considering that the audio duration of the audio to be processed, which needs to be subjected to audio understanding each time, is not fixed, in order to ensure that input information for inputting a large language model can be generated at one time for the audio to be processed after each compression processing, the total number of audio frames obtained after the compression processing for the audio to be processed may be a fixed value, for example, 400 frames, 500 frames, 700 frames, etc.

Step S240: and generating first input information for inputting a large language model based on the first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to carry out audio understanding.

Step S250: and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the present application, step S240 and step S250 may refer to the content of the foregoing embodiment, and are not described herein.

According to the audio processing method provided by the embodiment of the application, when the audio to be processed is understood through the large language model, at least one of audio fragment filtering and downsampling is carried out on a plurality of audio fragments according to the identified key frame, and then the input information for inputting the large language model is determined according to the compressed audio to be processed, so that when the long-term audio is understood, the large language model does not need to be input for many times after the audio to be processed is segmented, and the large language model can better understand the long-term audio, and the accuracy of understanding the long-term audio is improved; in addition, because the audio to be processed is obtained by at least one of audio fragment filtering and downsampling of a plurality of audio fragments according to the key frames, necessary audio frames can be reserved, so that after the input information determined based on the compressed audio to be processed is input into a large language model, the large language model can effectively extract information in the audio to be processed, and further accuracy of audio understanding is guaranteed.

Referring to fig. 3, fig. 3 is a flow chart illustrating an audio processing method according to another embodiment of the application. The audio processing method is applied to the computer device, and will be described in detail with respect to the flowchart shown in fig. 3, and the audio processing method specifically may include the following steps:

step S310: the audio to be processed is divided into a plurality of audio pieces.

Step S320: and carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed.

Step S330: and compressing the audio to be processed based on the key frame and the plurality of audio fragments.

In the embodiment of the present application, the steps S310 to S330 may refer to the content of other embodiments, which are not described herein.

Step S340: and extracting the characteristics of the audio to be processed after the compression processing to obtain the characteristics to be input.

In the embodiment of the application, after the audio to be processed is compressed based on the identified key frame and the divided plurality of audio fragments, when the first input information for inputting the large language model is generated based on the first prompt information and the compressed audio to be processed, the compressed audio to be processed can be subjected to feature extraction, and the audio features corresponding to the compressed audio to be processed can be converted into the information for inputting the large language model, namely, the feature space of the large language model, so that the audio features can be input into the large language model.

In some embodiments, when the feature extraction is performed on the audio to be processed to obtain the feature to be input, feature extraction may be performed on a spectrogram corresponding to each audio segment in the audio to be processed after the compression processing to obtain a global feature of the spectrogram corresponding to each audio segment and block features (patch features) corresponding to a plurality of image blocks; splicing the global features corresponding to the audio to be processed according to the sequence of the audio fragments in the audio to be processed to obtain first features; carrying out average pooling on the block features corresponding to the audio to be processed to obtain second features; and splicing the first feature and the second feature to obtain the feature to be input.

Wherein global features generally refer to features extracted in the whole image, such as color, texture, shape, etc., which may describe the content and context information of the whole image; patch features are features of local patches or patches extracted from an image, typically used to express local detail information of the image, which may be color, texture, shape, or the like.

In the above embodiment, when the audio to be processed is compressed, a spectrogram corresponding to each audio piece may be determined for each audio piece remaining in the audio to be processed after the compression processing, so as to use the spectrogram as input information for extracting features. The corresponding spectrogram can be obtained by carrying out spectrum analysis on the reserved audio signal of each audio fragment.

In one possible implementation, the spectrograms corresponding to all the reserved audio clips can be input into a feature extraction model trained in advance, so that image features corresponding to each spectrogram, namely global features and patch features are obtained. The specific type of feature extraction model used for image feature extraction may not be limited, and may be, for example, a visual encoder such as a low resolution variant of a visual classifier (Vision Transformer for Low Resolution Classification, viT-L).

In the above embodiment, after the image features corresponding to the spectrograms are extracted, pooling may be performed on the image features corresponding to the spectrograms, so as to reduce the data volume, reduce the subsequent calculation amount and memory consumption, and then the pooled image features are spliced to obtain the feature to be input. The image features can be input into a space pool and a time pool, so that the image features corresponding to the spectrograms are spatially pooled through the space pool, the image features corresponding to the spectrograms are time pooled through the time pool, and then the pooled image features obtained by the space pool and the time pool are spliced, so that the features to be input are obtained. Specifically, in the time pool, global features corresponding to the spectrograms of all the reserved audio clips can be spliced according to the sequence of the audio clips in the audio to be processed to obtain first features; in the space pool, carrying out average pooling on block features corresponding to all the spectrograms to obtain a second feature; and then splicing the first feature and the second feature to obtain the feature to be input. Therefore, the time pool is carried out according to the time sequence of all the reserved audio clips, namely the sequence of the audio clips, so that the time information can be reserved, and the accuracy of audio understanding is ensured.

In some embodiments, since the audio is understood, the audio encoder for extracting audio features may be used to extract audio features from audio according to a general audio processing manner, generate audio tokens according to the extracted audio features, and splice the generated audio tokens with the first prompt information, so as to obtain the first input information for inputting the large language model.

Step S350: and splicing the features to be input with the first prompt information to obtain first input information for inputting a large language model, wherein the first prompt information is used for prompting the large language model to carry out audio understanding.

Step S360: and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the present application, step S350 and step S360 may refer to the content of other embodiments, which are not described herein.

According to the audio processing method provided by the embodiment of the application, when the audio to be processed is understood through the large language model, the feature extraction is carried out on the compressed audio to be processed after the compression processing is carried out on the audio to be processed according to the identified key frames and the divided audio fragments, and the input information for inputting the large language model is determined according to the extracted features and the first prompt information, so that the large language model is better understood on the audio to be processed without inputting the large language model for multiple times after the audio to be processed is segmented when the audio to be processed is understood, and the accuracy of the long-time audio understanding is improved.

Referring to fig. 4, fig. 4 is a flow chart illustrating an audio processing method according to still another embodiment of the application. The audio processing method is applied to the computer device, and will be described in detail with respect to the flowchart shown in fig. 4, and the audio processing method specifically may include the following steps:

step S410: the audio to be processed is divided into a plurality of audio pieces.

Step S420: and carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed.

Step S430: and compressing the audio to be processed based on the key frame and the plurality of audio fragments.

Step S440: and generating first input information for inputting a large language model based on the first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to carry out audio understanding.

Step S450: and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the present application, the step S410 and the step S450 may refer to the contents of other embodiments, which are not described herein.

Step S460: and inputting second prompt information and the audio understanding result into a generation type model to obtain target data of a generated target type, wherein the second prompt information is used for prompting the generation type model to generate the data of the target type.

In the embodiment of the application, after the audio to be processed is subjected to audio understanding through the large language model to obtain the audio understanding result, the target type data, such as video data, image data and the like, can be further generated according to the audio understanding result, so that multi-mode data can be generated according to the understanding of the audio to be processed. The second prompt information for prompting the generation type model to generate the target type data can be acquired, and then the second prompt information and the audio understanding result are input into the generation type model to obtain the generated target type target data. The above generative model may be a pre-trained model capable of generating a corresponding type of data, for example, may be an AIGC-based generative large language model; when the target type is a video type, the generative model may also be a VideoComposer, etc., and of course, the specific type of the generative model may not be limited in the embodiment of the present application.

In some embodiments, the above generated target data of the target type may be various types of data, thereby enriching the types of data generated for the audio understanding result. For example, an image may be generated for an audio understanding result for a first time period in the audio to be processed and a video may be generated for an audio understanding result for a second time period in the audio to be processed.

According to the audio processing method provided by the embodiment of the application, when the audio to be processed is understood through the large language model, after the audio to be processed is compressed according to the identified key frame and the divided audio fragments, the input information for inputting the large language model is determined according to the compressed audio to be processed, so that when the long-term audio is understood, the large language model is not required to be input for many times after the audio to be processed is segmented, the large language model can better understand the long-term audio, and the accuracy of understanding the long-term audio is improved; in addition, corresponding types of data can be further generated according to the audio understanding result, so that multi-mode data are generated according to the understanding of the audio to be processed, and the application scene is enriched.

Referring to fig. 5, fig. 5 is a flow chart illustrating an audio processing method according to still another embodiment of the application. The audio processing method is applied to the computer device, and will be described in detail with respect to the flowchart shown in fig. 5, and the audio processing method specifically may include the following steps:

step S510: and acquiring the audio time length corresponding to the audio to be processed.

In the embodiment of the application, when the audio to be processed is understood, the audio duration corresponding to the audio to be processed can be determined so as to determine whether the audio to be processed which needs to be understood is long-term audio.

Step S520: and if the audio time length is smaller than or equal to the target time length, generating third input information for inputting a large language model based on the audio to be processed and the first prompt information.

In the embodiment of the application, after the audio time length of the audio to be processed is acquired, the audio time length can be compared with the target time length; according to the comparison result, if the audio duration is smaller than or equal to the target duration, the audio to be processed which needs to be subjected to audio understanding at this time is not long-term audio, so that according to a conventional processing mode, input information determined for the audio to be processed can be input into the large language model at one time without inputting the large language model for multiple times after the audio to be processed is segmented, and second input information for inputting the large language model can be generated directly based on the audio frames and the target prompt information in the audio to be processed.

The method comprises the steps that feature extraction can be conducted on audio to be processed to obtain features to be input; and splicing the feature to be input with the first prompt information to obtain third input information for inputting the large language model. The method for extracting features of the audio to be processed may refer to the content of the previous embodiment, which is not described herein.

Step S530: and inputting the third input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the application, after the third input information is obtained, the third input information can be input into the large language model, so that an audio understanding result corresponding to the audio to be processed is obtained. Thus, the input information can be input into the large language model once, and the audio understanding of the audio to be processed can be completed.

Step S540: if the audio time length is longer than the target time length, dividing the audio to be processed into a plurality of audio fragments.

In the embodiment of the application, after the audio time length of the audio to be processed is acquired, the audio time length is compared with the target time length; according to the comparison result, if the audio time length is longer than the target time length, the audio to be processed which needs to be subjected to audio understanding at this time is indicated to be long-time audio, so that the audio to be processed can be divided into a plurality of audio fragments, and subsequent steps are executed, thereby improving the accuracy of audio understanding of the long-time audio.

Step S550: and carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed.

Step S560: and compressing the audio to be processed based on the key frame and the plurality of audio fragments.

Step S570: and generating first input information for inputting a large language model based on the first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to carry out audio understanding.

Step S580: and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

In the embodiment of the present application, the steps S550 to S580 can refer to the content of other embodiments, which are not described herein.

According to the audio processing method provided by the embodiment of the application, when the audio to be processed is understood through the large language model, according to the audio time length of the audio to be processed, under the condition that the audio time length is longer than the target time length, after the compression processing is carried out on the audio to be processed according to the identified key frames and the divided audio fragments, the input information for inputting the large language model is determined according to the audio to be processed after the compression processing, so that when the long-term audio is understood, the input information for inputting the large language model can be determined for the audio to be input at one time, the large language model can better understand the long-term audio, and the accuracy of understanding the long-term audio is improved; and under the condition that the audio time length is not longer than the target time length, the input information for inputting the large language model is directly determined according to the audio to be processed, so that not only can the long-time audio be accurately understood, but also the short-time audio can be accurately understood.

The audio processing method provided by the foregoing embodiment will be described by way of example.

Referring to fig. 6, when audio understanding is performed on the audio to be processed, key frame audio and audio segment division may be performed on the audio to be processed, so that the audio to be processed is marked with a time stamp for dividing the audio segment, and a key frame; then, based on the marked time stamp and the key frame, compressing the audio to be processed; inputting the spectrograms corresponding to the audio fragments in the compressed audio to be processed into an audio encoder to obtain corresponding characteristics; inputting the extracted features into a space pool and a time pool, splicing global features in image features corresponding to all spectrograms according to the time sequence of the audio frames in the space pool, and carrying out average pooling on patch features in the image features corresponding to all spectrograms in the time pool; then, the features obtained from the space pool and the time pool are spliced and mapped to the feature space of the large language model, and the feature space is spliced with the first prompt information and then input into the large language model, so that the large language model outputs an audio understanding result according to the input information, for example, in 3S-5.6S, two people exchange at high speed, sichuan accents and Chinese and English inclusions; 5min30S-8min15S, 4 people fight against the vehicle, experience the scenes of streets, bridges and rivers, and have cicada sounds inside.

Referring to fig. 7, after audio understanding is performed on audio to be processed by the audio processing method provided by the embodiment of the present application, the result output by the large language model may be input into the generating model together with the second prompt information for prompting the generating model to generate the data of the target type, and the generating model may generate the target data of the target type according to the input information, for example, generate an image for the audio understanding result of 3S-5.6S, generate the audio understanding result for 5min30S-8min15S, and generate the corresponding video.

Referring to fig. 8, a block diagram of an audio processing apparatus 700 according to an embodiment of the application is shown. The audio processing apparatus 700 applies the above-described computer device, and the audio processing apparatus 700 includes: an audio segmentation module 710, a key frame identification module 720, an audio compression module 730, an information generation module 740, and an audio understanding module 750. Wherein the audio segmentation module 710 is configured to divide audio to be processed into a plurality of audio segments; the key frame identification module 720 is configured to identify a key frame of the audio to be processed, so as to obtain a key frame in the audio to be processed; the audio compression module 730 is configured to compress the audio to be processed based on the key frame and the plurality of audio clips; the information generating module 740 is configured to generate first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, where the first prompt information is used to prompt the large language model to perform audio understanding; the audio understanding module 750 is configured to input the first input information to the large language model, and obtain an audio understanding result corresponding to the audio to be processed.

In some embodiments, the audio compression module 730 may be specifically configured to at least one of audio segment filter and downsample the plurality of audio segments based on the key frame, where the audio segment filter includes filtering at least one audio segment of the plurality of audio segments.

In a possible implementation, the audio compression module 730 may be further specifically configured to determine a first audio segment of the plurality of audio segments that does not include the key frame; the first audio segment is filtered from the plurality of audio segments.

In a possible implementation, the audio compression module 730 may be further specifically configured to determine a second audio segment of the plurality of audio segments that includes at least one of the key frames; downsampling the second audio segment.

Optionally, the audio compression module 730 may be further specifically configured to divide the second audio segment based on the key frame in the second audio segment to obtain a plurality of audio sub-segments corresponding to the second audio segment; and downsampling each audio sub-segment in the plurality of audio sub-segments, wherein the sampling rate corresponding to the audio sub-segment in which the key frame is positioned is higher than the sampling rate corresponding to other audio sub-segments, and the other audio sub-segments are audio sub-segments in the plurality of audio sub-segments except the audio sub-segment in which the key frame is positioned.

In some embodiments, the audio segmentation module 710 may be specifically configured to perform voice activity detection on the audio to be processed, so as to obtain a detection result; and dividing the audio to be processed into a plurality of audio fragments according to the detection result.

In a possible implementation manner, the audio segmentation module 710 may be further specifically configured to label the audio to be processed with a plurality of target time stamps according to the detection result, where the audio between two adjacent target time stamps forms an audio segment, and the target time stamps are located in a starting audio frame and a cut-off audio frame of the audio segment where the voice signal exists in the detection result.

In some embodiments, the information generating module 740 may be specifically configured to perform feature extraction on the audio to be processed after the compression processing, so as to obtain features to be input; and splicing the feature to be input with the first prompt information to obtain first input information for inputting a large language model.

In a possible implementation manner, the information generating module 740 may be further specifically configured to perform feature extraction on a spectrogram corresponding to each audio segment in the audio to be processed after the compression processing, so as to obtain global features of the spectrogram corresponding to each audio segment and block features corresponding to a plurality of image blocks; splicing the global features corresponding to the audio to be processed according to the sequence of the audio fragments in the audio to be processed to obtain first features; carrying out average pooling on the block features corresponding to the audio to be processed to obtain second features; and splicing the first feature and the second feature to obtain the feature to be input.

In some implementations, the audio processing device 700 may also include a data generation module. The data generation module is used for inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed, then inputting second prompt information and the audio understanding result into the generation model to obtain target data of the generated target type, and the second prompt information is used for prompting the generation model to generate the data of the target type.

In some embodiments, the audio processing apparatus 700 may further include a duration obtaining module, where the duration obtaining module is configured to obtain an audio duration corresponding to the audio to be processed before the audio to be processed is divided into the plurality of audio segments; the audio segmentation module 710 may be specifically configured to divide the audio to be processed into a plurality of audio segments if the audio time period is longer than a target time period.

In a possible implementation manner, the information generating module 740 may be further configured to, after the obtaining the audio duration corresponding to the audio to be processed, generate third input information for inputting a large language model based on the audio to be processed and the first prompt information if the audio duration is less than or equal to a target duration; the audio understanding module 750 may be further configured to input the third input information to the large language model, so as to obtain an audio understanding result corresponding to the audio to be processed.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided by the present application, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, according to the scheme provided by the application, the audio to be processed is divided into the plurality of audio fragments, the key frame identification is performed on the audio to be processed, the key frame in the audio to be processed is obtained, the compression processing is performed on the audio to be processed based on the key frame and the plurality of audio fragments, the first input information for inputting the large language model is generated based on the first prompt information for prompting the large language model to understand the audio and the audio to be processed after the compression processing, and then the first input information is input to the large language model, so that the audio understanding result corresponding to the audio to be processed is obtained. Therefore, when the audio to be processed is understood through the large language model, the input information for inputting the large language model is determined according to the compressed audio to be processed after the audio to be processed is compressed according to the identified key frames and the divided audio fragments, and therefore when the long-term audio is understood, the large language model does not need to be input for many times after the audio to be processed is segmented, the large language model can better understand the long-term audio, and accuracy of the long-term audio understanding is improved.

Referring to fig. 9, a block diagram of a computer device according to an embodiment of the application is shown. The computer device 100 may be a smart phone, tablet, smart watch, e-book, etc. capable of running applications. The computer device 100 of the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more applications configured to perform the method as described in the foregoing method embodiments.

Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall computer device 100, perform various functions of the computer device 100 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the computer device 100 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

Referring to fig. 10, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of audio processing, the method comprising:

dividing audio to be processed into a plurality of audio fragments;

performing key frame identification on the audio to be processed to obtain key frames in the audio to be processed;

compressing the audio to be processed based on the key frame and the plurality of audio fragments;

generating first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, wherein the first prompt information is used for prompting the large language model to perform audio understanding;

and inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

2. The method of claim 1, wherein the compressing the audio to be processed based on the key frame and the plurality of audio segments comprises:

and performing at least one of audio segment filtering and downsampling on the plurality of audio segments based on the key frame, wherein the audio segment filtering comprises filtering at least one audio segment of the plurality of audio segments.

3. The method of claim 2, wherein at least one of audio segment filtering and downsampling the plurality of audio segments based on the key frames comprises:

determining a first audio segment of the plurality of audio segments that does not include the key frame;

the first audio segment is filtered from the plurality of audio segments.

4. The method of claim 2, wherein at least one of audio segment filtering and downsampling the plurality of audio segments based on the key frames comprises:

determining a second audio segment of the plurality of audio segments that includes at least one of the key frames;

downsampling the second audio segment.

5. The method of claim 4, wherein downsampling the second audio segment comprises:

dividing the second audio segment based on the key frame in the second audio segment to obtain a plurality of audio sub-segments corresponding to the second audio segment;

and downsampling each audio sub-segment in the plurality of audio sub-segments, wherein the sampling rate corresponding to the audio sub-segment in which the key frame is positioned is higher than the sampling rate corresponding to other audio sub-segments, and the other audio sub-segments are audio sub-segments in the plurality of audio sub-segments except the audio sub-segment in which the key frame is positioned.

6. The method of claim 1, wherein the dividing the audio to be processed into a plurality of audio segments comprises:

performing voice activity detection on the audio to be processed to obtain a detection result;

and dividing the audio to be processed into a plurality of audio fragments according to the detection result.

7. The method of claim 6, wherein dividing the audio to be processed into a plurality of audio segments according to the detection result comprises:

and marking a plurality of target time stamps on the audio to be processed according to the detection result, wherein the audio between two adjacent target time stamps forms an audio fragment, and the target time stamps are positioned at the starting audio frame and the ending audio frame of the audio fragment with the voice signal in the detection result.

8. The method according to any one of claims 1 to 7, wherein the generating first input information first hint information for inputting a large language model based on the first hint information and the audio to be processed after the compression processing includes:

extracting characteristics of the audio to be processed after the compression processing to obtain characteristics to be input;

and splicing the feature to be input with the first prompt information to obtain first input information for inputting a large language model.

9. The method according to claim 8, wherein the feature extraction of the audio to be processed after the compression processing to obtain the feature to be input includes:

extracting features of a spectrogram corresponding to each audio fragment in the audio to be processed after the compression processing to obtain global features of the spectrogram corresponding to each audio fragment and block features corresponding to a plurality of image blocks;

splicing the global features corresponding to the audio to be processed according to the sequence of the audio fragments in the audio to be processed to obtain first features;

carrying out average pooling on the block features corresponding to the audio to be processed to obtain second features;

And splicing the first feature and the second feature to obtain the feature to be input.

10. The method of any of claims 1-7, wherein after the inputting the first input information to the large language model results in the audio comprehension results corresponding to the audio to be processed, the method further comprises:

and inputting second prompt information and the audio understanding result into a generation type model to obtain target data of a generated target type, wherein the second prompt information is used for prompting the generation type model to generate the data of the target type.

11. The method of any of claims 1-7, wherein prior to the dividing the audio to be processed into the plurality of audio segments, the method further comprises:

acquiring the audio time length corresponding to the audio to be processed;

and if the audio time length is longer than the target time length, executing the step of dividing the audio to be processed into a plurality of audio fragments.

12. The method of claim 11, wherein after the obtaining the audio duration corresponding to the audio to be processed, the method further comprises:

if the audio time length is smaller than or equal to the target time length, generating third input information for inputting a large language model based on the audio to be processed and the first prompt information;

And inputting the third input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

13. An audio processing apparatus, the apparatus comprising: an audio segmentation module, a key frame identification module, an audio compression module, an information generation module and an audio understanding module, wherein,

the audio segmentation module is used for dividing the audio to be processed into a plurality of audio fragments;

the key frame identification module is used for carrying out key frame identification on the audio to be processed to obtain key frames in the audio to be processed;

the audio compression module is used for compressing the audio to be processed based on the key frame and the plurality of audio fragments;

the information generation module is used for generating first input information for inputting a large language model based on first prompt information and the audio to be processed after the compression processing, and the first prompt information is used for prompting the large language model to understand the audio;

the audio understanding module is used for inputting the first input information into the large language model to obtain an audio understanding result corresponding to the audio to be processed.

14. A computer device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-12.

15. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method according to any one of claims 1-12.