CN117037774A

CN117037774A - Model processing method, device, equipment and storage medium

Info

Publication number: CN117037774A
Application number: CN202311069869.7A
Authority: CN
Inventors: 陈献钊; 唐昌礼; 于文一; 孙广智; 谭天; 张超; 李伟; 卢璐; 马泽君
Original assignee: Tsinghua University; Beijing Youzhuju Network Technology Co Ltd
Current assignee: Tsinghua University; Beijing Youzhuju Network Technology Co Ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-10

Abstract

Embodiments of the present disclosure relate to a method, apparatus, device, and storage medium for model processing. The method comprises the following steps: acquiring audio content and a prompt item; determining a first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the reminder; converting, using a converter module of the object model and based on a preset set of query feature representations, the first intermediate feature representation into a second intermediate feature representation, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations; and generating content for responding to the reminder item based at least on the second intermediate feature representation. In this way, the embodiment of the disclosure can realize cross fusion of information of different modes, and can make the output of the model richer and more accurate.

Description

Model processing method, device, equipment and storage medium

Technical Field

Example embodiments of the present disclosure relate generally to the field of computers and, more particularly, relate to a method, apparatus, device, and storage medium for model processing.

Background

With the development of computer technology, machine learning and other technologies have been widely applied to various aspects of people's lives. One can utilize various types of models implemented based on machine learning to accomplish various types of tasks. In a typical application scenario, a person may direct a multimodal language model to generate desired content, for example, by providing information of multiple modalities (e.g., information of a text modality and information of an audio modality) to the multimodal language model. It is desirable to better train the multimodal language model to enhance the capabilities of the multimodal language model.

Disclosure of Invention

In a first aspect of the present disclosure, a model processing method is provided. The method comprises the following steps: acquiring audio content and a prompt item; determining a first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the reminder; converting, using a converter module of the object model and based on a preset set of query feature representations, the first intermediate feature representation into a second intermediate feature representation, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations; and generating content for responding to the reminder item based at least on the second intermediate feature representation.

In a second aspect of the present disclosure, an apparatus for model processing is provided. The device comprises: an acquisition module configured to acquire audio content and a reminder; a determining module configured to determine a first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the reminder; a conversion module configured to convert the first intermediate feature representation to a second intermediate feature representation using the converter module of the object model and based on a preset set of query feature representations, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations; and a generation module configured to generate content for responding to the reminder item based at least on the second intermediate feature representation.

In a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the electronic device to perform the method of the first aspect.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to implement the method of the first aspect.

It should be understood that what is described in this section of the disclosure is not intended to limit key features or essential features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure may be implemented;

FIG. 2 illustrates an example interactive interface, according to some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of an example structure of a target model, according to some embodiments of the present disclosure;

FIG. 4 illustrates a flow chart of a process for model processing according to some embodiments of the present disclosure

FIG. 5 illustrates a block diagram of an apparatus for model processing, according to some embodiments of the present disclosure; and

fig. 6 illustrates a block diagram of an apparatus capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be more thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below.

In this context, unless explicitly stated otherwise, performing a step "in response to a" does not mean that the step is performed immediately after "a", but may include one or more intermediate steps.

It will be appreciated that the data (including but not limited to the data itself, the acquisition, use, storage or deletion of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the relevant users, which may include any type of rights subjects, such as individuals, enterprises, groups, etc., should be informed and authorized by appropriate means of the types of information, usage ranges, usage scenarios, etc. involved in the present disclosure according to relevant legal regulations.

For example, in response to receiving an active request from a user, prompt information is sent to the relevant user to explicitly prompt the relevant user that the operation requested to be performed will need to obtain and use information to the relevant user, so that the relevant user may autonomously select whether to provide information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operation of the technical solution of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation manner, in response to receiving an active request from a relevant user, the prompt information may be sent to the relevant user, for example, in a popup window, where the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The neural network model is one example of a deep learning-based model. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "learning network," which terms are used interchangeably herein.

As mentioned previously, it is desirable to better train the multimodal language model to enhance the capabilities of the multimodal language model. Conventionally, when training a multi-modal language model, it is critical to align features of other modalities into a text modality, so as to better exploit the capabilities of the language model. However, during the training process, due to the problems of data labeling and the like, the model is very easy to be fitted, and the capability outside the training set is difficult to emerge. The concrete steps are as follows: the language model cannot answer according to the instruction, the output of the language model is not rich enough, and so on.

The embodiment of the disclosure provides an improvement scheme of model processing. According to various embodiments of the present disclosure, audio content and a reminder item are obtained, and a first intermediate feature representation is determined based on an audio feature representation of the audio content and a text feature representation of the reminder item. Further, the first intermediate feature representation is converted to a second intermediate feature representation using a converter module of the object model and based on a preset set of query feature representations. And further generating content for responding to the reminder item based at least on the second intermediate representation. In this way, the embodiment of the disclosure can realize cross fusion of information of different modes, can perform regular processing on the model, and can make the output of the model richer and more accurate.

Some example embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Example Environment

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure may be implemented. As shown in fig. 1, environment 100 may include a terminal device 115, which terminal device 115 may be any suitable electronic device associated with a user.

The terminal device 115 may obtain the audio content 105 and the reminder 110. Such audio content 105 may include audio files uploaded by a user or audio content recorded by a user through an audio acquisition device (e.g., microphone, etc.). Accordingly, the reminder item 110 (also referred to as a reminder word or guide word) may be text content entered by the user into the terminal device 115. Alternatively, the user may also enter voice content and convert it to text-type prompts by the terminal device 115.

Further, the terminal device 115 can interact with the electronic device 120 to process the audio content 105 and the reminder 110 using a goal model 125 deployed in the electronic device 120. In some embodiments, such a target model 125 may be implemented based on machine learning techniques. Specific implementation details regarding the object model 125 will be described in detail below with reference to fig. 3, and will not be described in detail herein. Further, the electronic device 120 may include a separate electronic device or multiple separate or centrally deployed electronic devices for deploying the target model 125. The present disclosure is not intended to limit the morphology of the electronic device 120.

Further, the object model 125 can generate content 130 based on the audio content 105 and the reminder 110 and transmit the content 130 to the terminal device 115. Accordingly, the terminal device 115 may provide the content 130 to the user, for example, through an interactive interface. In some embodiments, content 130 may include, for example, text content. Alternatively or additionally, the content 130 may also include, for example, suitable types of media content such as audio content.

In some embodiments, the terminal device 115 may be any type of mobile terminal, fixed terminal, or portable terminal, including a mobile handset, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/video camera, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination of the preceding, including accessories and peripherals for these devices, or any combination thereof. In some embodiments, terminal device 115 is also capable of supporting any type of interface to the user (such as "wearable" circuitry, etc.). The electronic device 120 may be various types of computing systems/servers capable of providing computing capabilities, including but not limited to mainframes, edge computing nodes, computing devices in a cloud environment, and so forth.

It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and are not meant to suggest any limitation as to the scope of the disclosure.

Example interface

As discussed with reference to fig. 1, the terminal device 115 may utilize an interactive interface, for example, to obtain the audio content 105 and the reminder 110 and provide the content 130 generated by the goal model 125 accordingly. Fig. 2 illustrates an example interface 200, which interface 200 may be provided by, for example, a terminal device 115, in accordance with some embodiments of the present disclosure.

As shown in fig. 2, interface 200 may include controls 220 and 230 for inputting audio content. Illustratively, a user may invoke the audio collection device of the terminal device 115 by triggering the control 220 and may obtain audio content 210 in real-time for a period of time. As another example, the user may also re-enter a new piece of audio content, for example, by triggering control 230.

Alternatively or additionally, in addition to utilizing the audio collection device to obtain real-time audio, the user may upload the audio file, for example, through appropriate manipulation (e.g., dragging, etc.).

In some embodiments, the user may be allowed to enter only a piece of audio content or upload a piece of audio file, for example. Alternatively or additionally, the user may be allowed to upload pieces of audio content, for example. In some embodiments, the total duration of the input audio content may be limited.

In addition, as shown in FIG. 2, interface 200 may also include an input control 250 for receiving user-entered reminder items. Illustratively, the user may enter a reminder item by typing in the input control 250. Alternatively, the user may also enter voice, for example, and the terminal device converts the entered voice into a text prompt.

In some embodiments, interface 200 may also include parameter adjustment controls 240 for adjusting the generation of the target model. Such parameter adjustment controls may be used, for example, to control content generation of the target model. For example, such example parameters may include a "temperature" parameter to control the randomness and creativity of the target model generation content.

Illustratively, in the event that the user completes entry of the reminder item via the input control 250 and has entered the corresponding audio content 210, the terminal device 115 can utilize the goal model 125 to generate the corresponding content.

Taking fig. 2 as an example, an entered reminder 261 may be presented in the content area 260. The content region 260 may present content 262 generated by the object model 125 from the reminder 261 and the audio content 210.

In some embodiments, such content 262 may include text content that may be presented in the content region 260 in a dialog-like fashion.

In some embodiments, such content 262 may include appropriate media content that may be presented or played in the content area 260. For example, where the content 262 is audio content, the terminal device 115 may play the generated audio content automatically or in response to a user trigger.

Based on the example interfaces discussed above, embodiments of the present disclosure can support a user to input multi-modal information for interacting with a target model, thereby improving information analysis and understanding capabilities of the target model and enhancing the quality of content generation.

Example model architecture

An example model architecture according to some embodiments of the present disclosure will be shown below with reference to fig. 3. FIG. 3 illustrates an example architecture diagram 300 of a target model 125 according to some embodiments of the disclosure. As shown in fig. 3, the architecture 300 of the object model 125 includes an audio encoding module 320, a text encoding module 350, a converter module 370, a language processing module 391, and a fine tuning module 392.

The audio content 310 entered by the user will be provided to the audio encoding module 320. The audio encoding module 320 is configured to encode the audio content 310 to output an audio feature representation 330 corresponding to the audio content 310. It should be appreciated that where multiple audio content 310 are included, they may be spliced and encoded, or separately encoded and spliced.

Similar to the audio encoding module 320, the user-entered reminder item 340 will be provided to the text encoding module 350. Text encoding module 350 is configured to encode reminder 340 to output text feature representations 360 corresponding to reminder 340. It should be appreciated that any suitable coding model and/or encoder may be employed to implement audio coding module 320 and text coding module 350, and this disclosure is not intended to be limiting.

Because the audio feature representation 330 and the text feature representation 360 are feature representations corresponding to information of two different modalities (i.e., a text modality and an audio modality), the audio feature representation 330 and the text feature representation 360 are often feature representations of different dimensions. In some embodiments, the audio feature representation 330 output by the audio encoding module 320, or the text feature representation 360 output by the text encoding module 350, may also be processed such that the audio feature representation 330 and the text feature representation 360 become feature representations of the same dimension for ease of subsequent processing. In particular, the text feature representation 360 may be projected to a feature dimension corresponding to the audio feature representation 330. The projected text feature representation 360 is the same feature dimension as the audio feature representation 330.

Further, a first intermediate feature representation input to the converter module 370 may be determined based on the audio feature representation 330 and the projected text feature representation 360. For example, the first intermediate feature representation may be determined by concatenating the audio feature representation 330 and the projected text feature representation 360. The first intermediate feature representation is to be provided to the converter module 370. In some embodiments, a set of preset query feature representations 371 will be provided to the encoder module 370 along with the first intermediate features. Such a set of preset query feature representations 371 may be determined based on a training process described below. In some scenarios, such a set of preset query feature representations 371 may include a predetermined number of query feature representations.

The converter module 370 is configured to convert the first intermediate feature representation into a second intermediate feature representation 380 using a set of preset query feature representations 371. It should be noted that the second intermediate feature representation 380 converted during the training process of the model is different from the second intermediate feature representation converted during the application process of the model, as will be described in more detail below. It should be appreciated that the converter module 370 may be implemented with an appropriate converter model, examples of which may include, but are not limited to: q-transformers, linear transformers, and the like.

In some embodiments, the second intermediate feature representation 380 output by the converter module 370 may be provided to the language processing module 391. The language processing module 391 is configured to generate target content for the audio content 310 and the reminder item 340 based at least on the second intermediate feature representation 380. By way of example, the language processing module 391 herein may include, for example, any suitable language model, which the present disclosure is not intended to limit.

In some embodiments, the target model may generate a first input for input to the speech processing module 391 based on the second intermediate representation 380 and the prompt 340. Illustratively, one or more tokens (also referred to as tokens) corresponding to the hint item 340 may be determined and the second intermediate representation combined with the one or more tokens to generate a first input for input to the language processing module 391. The language processing module 391 may generate the target content based at least on the first input.

Alternatively or additionally, in some embodiments, the input information provided to the language processing module 391 may also include a second input. The second input here may include, for example, a set of tuning parameters and weight parameters associated with the tuning module 392 of the target model 125. The fine tuning module 392 herein may be configured to reduce the training cost of the model and to accelerate the convergence of the model. Illustratively, the trim module 392 herein may include, for example, any suitable trim model such as Lora, P-tuning, etc., which the present disclosure is not intended to limit.

The set of tuning parameters here is the set of parameters that result after the target model 125 is trained. The weighting parameters herein may be used to control the degree to which a set of tuning parameters affects the language processing module 391. Taking the Lora model as an example of a fine tuning module, such weight parameters may include, for example, lora_alpha.

Further, the language processing module 391 may generate the target content based on the first input and the second input.

The training process of the target model 125 will be described below. In some embodiments, the audio encoding module 320 and the language processing module 391 in the target model 125 may be trained models, the parameters of which may remain fixed during training. In some embodiments, training of the target model 125 may be accomplished through a two-round training process.

During the first round of training, an initial set of query feature representations may be determined and the target model 125 trained with the first set of training data with other parameters besides the query feature representations 371 and the converter module 370 fixed. In some embodiments, the first set of training data may include a first set of audio samples corresponding to a speech recognition task.

Illustratively, during the first round of training, the prompt 340 may not be entered, but rather only the audio samples may be entered into the target model 125 to perform the speech recognition task, and the parameters of the converter module and the set of query feature representations may be adjusted via a loss function such as cross entropy based on the output text of the speech processing module 391 versus the labeling data for the set of audio samples. After the first round of training process is completed, the adjusted query feature representation may be referred to as a set of intermediate query feature representations and the adjusted parameters of the converter module 370 are referred to as a first set of parameters.

Further, a second training process may be performed using the second set of training data. The second set of training data herein may include a second set of audio samples and a set of training cues that correspond to the speech processing task. That is, unlike the first round training data process, the second training data may provide training cues to the target model 125 for full-flow training. In particular, the set of intermediate query feature representations, the first set of parameters of the converter module 370, and the second set of parameters of the text encoding module 350 may be updated with the second set of training data. In the case where the trimming module 392 is utilized, a second round of training process may cooperatively update the parameters of the trimming module 392. It should be noted that during the training phase of the model, the second intermediate feature representation 380 output by the converter module 370 is a feature representation corresponding to all query feature representations in the set of query feature representations, and the weight parameters in the fine tuning module 392 may be settable super-parameters, i.e. not vary with the training iteration process.

Similar to the first round training process, the parameters in the target model that are not fixed may be adjusted via a loss function such as cross entropy based on the output text of the speech processing module 391 compared to the labeling data for the set of audio samples.

In some embodiments, the speech processing tasks in the second wheel training process may also be richer than the first wheel training process. Illustratively, the speech processing tasks in the second round of training may include, but are not limited to: speech recognition tasks, speech translation tasks, speech question-answering tasks, phoneme recognition tasks, audio captioning tasks, and the like.

Therefore, the main parameters of the target model can be fixed through the voice recognition task, and all the parameters of the target model can be further and cooperatively determined through a fine tuning process with richer types, so that the efficiency and the accuracy of model training can be improved.

Furthermore, because the parameters of language processing module 391 during the training process remain fixed, embodiments of the present disclosure are also able to take advantage of the divergent capabilities of language processing module 391 to support certain types of tasks that are not trained. For example, while the second training process may not include training data corresponding to a story generation scenario, processing of such tasks may be supported based on the divergent capabilities of language processing module 391.

However, despite the somewhat divergent capabilities of language processing module 391, it may still suffer from model overfitting, which may make it difficult for the target model to respond effectively to task types outside of the training dataset.

In some embodiments, during an application phase (also referred to as a testing phase) of the target model, the trained target model may utilize the audio encoding module 320 and the text encoding module 350 to determine feature representations of the input audio content 310 and the prompt 340, respectively, and generate corresponding first intermediate feature representations. The trained model may utilize the converter module 370 to output a second intermediate feature representation 380 based on the first intermediate feature representation and a set of preset query feature representations 371. The trained model may in turn utilize language processing module 391 to process the second intermediate feature representation 380 to generate target content.

In some embodiments, to better exercise the ability of the trained model, a regularization process may be entered during the testing of the target model 125. In some embodiments, regularization may be introduced by having the second intermediate feature representation 380 correspond to a partial query feature representation in a set of preset query feature representations 371.

In some embodiments, unlike the training phase of the target model, during the application phase of the target model, the converter module 370 may output a second intermediate feature representation 380 corresponding to a partial query feature representation in the set of preset query feature representations 371. In this case, the second intermediate feature representation 380 corresponds to only a portion of the query feature representations in the set of preset query feature representations 371, rather than all of the query feature representations in the set of preset query feature representations 371 as in the training phase. In this way, the output of the model can be made richer and more accurate by a strongly regularized manner.

In particular, the converter module 370 may, for example, determine a set of similarities between the first intermediate feature representation and a set of query feature representations. The similarity herein may also be understood as a set of attentiveness (also referred to as cross attentiveness) between the first intermediate feature representation and a set of preset query feature representations 371. The converter module 370 may determine the set of similarities, for example, by an attention mechanism. It is to be appreciated that the converter module 370 can separately determine a similarity between the first intermediate feature representation and each of the set of preset query feature representations 371. That is, the converter module 370 determines that each of the resulting set of similarities is in a one-to-one correspondence with each of the set of preset query feature representations 371.

The converter module 370 can determine the target query feature representation from the plurality of query feature representations in response to determining that the set of preset query feature representations 371 includes a plurality of query feature representations having a similarity greater than a threshold. The threshold values herein may be predetermined by the user or may be determined by the object model 125 and/or the electronic device 120 itself. In some embodiments, to avoid information redundancy, the converter module 370 may determine only one target query feature representation from among the plurality of query feature representations having a similarity greater than the threshold value, regardless of the number of the plurality of query feature representations having a similarity greater than the threshold value in the set of preset query feature representations 371.

For example, if only one query feature representation with a similarity greater than a threshold exists in the set of preset query feature representations 371, the converter module 370 may directly determine the query feature representation as the target query feature representation. If there are multiple query feature representations of greater than a threshold similarity in the set of preset query feature representations 371, the converter module 370 may determine one query feature representation from the multiple query feature representations as the target query feature representation based on a predetermined determination. The determining means herein may include a random means (i.e., randomly determining the target query feature representation from a plurality of query feature representations), a sequential means (e.g., sorting in descending order of the value of similarity, preferentially selecting the query feature representation corresponding to the similarity with the largest value), and so forth. The present disclosure is not intended to be limited to the specific strategy of how the target query feature representation is determined.

The converter module 370 in turn converts the first intermediate feature representation to a second intermediate feature representation 380 based on the target query feature representation. The second intermediate feature representation 380 thus obtained may include portions corresponding to the target query feature representation and not include portions corresponding to other ones of the plurality of query feature representations having a similarity greater than the threshold. That is, portions corresponding to other query feature representations having a similarity greater than the threshold value will be discarded accordingly.

In other embodiments, the converter module 370 can also convert the first intermediate representation to a third intermediate representation based on a set of preset query feature representations 371. Specifically, the converter module 370 may process the first intermediate feature representation directly based on the set of preset query feature representations 371 to obtain a converted third intermediate feature representation. The converter module 370 may in turn process the third intermediate feature representation based on a preset mask to determine the second intermediate feature representation 380.

For example, if the third intermediate feature representation includes 32 feature representations, the converter module 370 may mask the 32 feature representations (e.g., mask 16 of the 32 feature representations) based on a preset mask to obtain a second intermediate feature representation 380 that includes the mask. The second intermediate feature representation 380 thus obtained may preserve portions of the third intermediate feature representation that correspond to portions of the query feature representation.

In this way, embodiments of the present disclosure may improve the divergent ability of a language processing model by introducing regularization in generating target feature representations corresponding to multimodal data, which may in turn alleviate the problem of model overfitting.

Alternatively or additionally, in some embodiments, strong regularization may also be introduced to the target model 125 by adjusting the weight parameters of the trim module 392. In this case, the weight parameters of the model in the second input of the application phase (which may be referred to as first weight parameters, for example) are not used for the weight parameters in the second input of the training phase (which may be referred to as second weight parameters, for example). Illustratively, the impact of a set of trim parameters from trim module 392 on language processing module 391 may be reduced by making the value of the first weight parameter less than the value of the second weight parameter.

In some embodiments, architecture 300 may also include a weight prediction module (not shown). In the test phase of the target model, the weight prediction module may be configured to determine a first weight parameter corresponding to the reminder item 340 based on the reminder item 340. The weight prediction module is trained based on a set of training cues and corresponding training weight parameters so that it can output a first weight parameter that matches the cues to place the over-fitting problem of the target model.

In still other embodiments, the language processing module 391 may also generate a corresponding set of candidate content based on a set of candidate weight parameters. Taking the example of Lora_alpha, a set of candidate weight parameters here may be, for example, -4,0,4, … …, 48. Further, a set of candidate content ratings information may be determined based at least on a correlation between the set of candidate content and the set of hints 340. It should be noted that the evaluation information may be determined by the language processing module 391 or may be determined by another language processing module or language model.

Such rating information may be used to indicate a correlation between the set of candidate content and the reminder 340. For example, the following hints may be constructed for the first candidate content and hint item 340 in a set of candidate content: please determine the correlation between this content and the reminder, which indicates whether the content correctly follows the reminder and whether the quality of the content is rich. Such hints may be provided, for example, to language processing module 391 to obtain rating information for the set of candidate content.

Further, the content for responding to the reminder item may also be determined from a set of candidate contents based on the evaluation information of the set of candidate contents. For example, the target candidate content with the highest correlation degree may be selected from the group of candidate contents as the response content. Correspondingly, the candidate weight parameter corresponding to the target candidate content is the first weight parameter.

In this way, the embodiment of the disclosure can improve the divergence capability of the language processing module by adjusting the weight of the fine tuning model, so as to support the task that the target model processes different from the training data.

Based on the manner discussed above, the target model 125 can support target tasks that perform training tasks during the testing phase that are different from the training phase. Illustratively, during the training phase of the model, the target model 125 is trained based on a set of training data associated with the speech recognition task. As introduced above, such a set of training data may correspond to a speech recognition task, a speech translation task, a speech question-and-answer task, a phoneme recognition task, an audio subtitle task, and so forth. Further, during the testing phase of the model, the target model 125 may, for example, perform audio-based story generation tasks, such tasks not being included in the task types corresponding to the set of training data.

In this way, the embodiment of the disclosure can realize cross fusion of information of different modes, and can improve the processing capacity of the model, thereby improving the quality of content generation.

Example procedure

Fig. 4 illustrates a flow chart of a process 400 for model processing according to some embodiments of the present disclosure. The process 400 may be implemented at the electronic device 120. The process 400 is described below with reference to fig. 4.

At block 410, the electronic device 120 obtains audio content and a reminder item.

At block 420, the electronic device 120 determines a first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the alert item.

At block 430, the electronic device 120 converts the first intermediate feature representation into a second intermediate feature representation using a converter module of the object model and based on a preset set of query feature representations, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations.

At block 440, the electronic device 120 generates content for responding to the reminder item based at least on the second intermediate feature representation.

In some embodiments, the prompt is to instruct a target model to perform a target task based on the audio content, the target model being trained based on a set of training data, the target task being different from a set of training tasks corresponding to the set of training data.

In some embodiments, converting the first intermediate feature representation to the second intermediate feature representation includes: determining a set of similarities between the first intermediate feature representation and a set of query feature representations; responsive to determining that the set of query feature representations includes a plurality of query feature representations having a similarity greater than a threshold, determining a target query feature representation from the plurality of query feature representations; and converting the first intermediate feature representation into a second intermediate feature representation such that the second intermediate feature representation includes portions corresponding to the target query feature representation and does not include portions corresponding to other query feature representations of the plurality of query feature representations.

In some embodiments, converting the first intermediate feature representation to the second intermediate feature representation includes: converting the first intermediate representation into a third intermediate representation based on a set of query feature representations; and processing the third intermediate feature representation based on a preset mask to determine the second intermediate feature representation to preserve portions of the third intermediate feature representation corresponding to the partial query feature representation.

In some embodiments, generating content for responding to the reminder item based at least on the second intermediate feature representations includes: generating a first input for input to the language processing module based on the second intermediate feature representation and the prompt; and generating, by the language processing module, content for responding to the reminder item based at least on the first input.

In some embodiments, generating, by the language processing module, content for responding to the reminder item based at least on the first input includes: obtaining a second input for input to the language processing module, the second input comprising a set of tuning parameters and a first weight parameter associated with a tuning module of the target model; and generating, by the language processing module, content for responding to the reminder based on the first input and the second input.

In some embodiments, the first weight parameter is different from a second weight parameter corresponding to a training process of the target model.

In some embodiments, generating, by the language processing module, content for responding to the reminder item based on the first input and the second input includes: generating a corresponding set of candidate content based on the set of candidate weight parameters; determining rating information of a set of candidate content based at least on a correlation between the set of candidate content and the set of hints; and determining content for responding to the reminder item from the set of candidate content based on the rating information of the set of candidate content.

In some embodiments, obtaining a second input for input to the language processing module comprises: a first weight parameter corresponding to the reminder item is determined based on the reminder item using a weight prediction module, wherein the weight prediction module is trained based on a set of training reminder items and corresponding training weight parameters.

In some embodiments, determining the first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the alert item comprises: projecting the text feature representation to a feature dimension corresponding to the audio feature representation; and determining a first intermediate feature representation based on the audio feature representation and the projected text feature representation.

In some embodiments, the target model is trained based at least on the following process: determining an initial set of query feature representations; and training the target model with the first set of training data to determine a first set of parameters and a set of intermediate query feature representations of the converter module with other parameters of the target model than the converter module and the query feature representations.

In some embodiments, the first set of training data includes a first set of audio samples corresponding to a speech recognition task.

In some embodiments, the target model further comprises a text encoding module for generating a text feature representation of the reminder item, and the target model is further trained based on: the target model is trained using the second set of training data to update at least the first set of parameters of the converter module, the set of intermediate query feature representations, the second set of parameters of the text encoding module.

In some embodiments, the second set of training data includes a second set of audio samples and a set of training cues that correspond to the speech processing task.

In some embodiments, the speech processing tasks include at least one of: speech recognition task, speech translation task, speech question-answer task, phoneme recognition task, audio subtitle task.

Example apparatus and apparatus

Fig. 5 illustrates a schematic block diagram of an apparatus 500 for model processing according to some embodiments of the present disclosure. The apparatus 500 may be implemented as or included in the electronic device 120. The various modules/components in apparatus 500 may be implemented in hardware, software, firmware, or any combination thereof.

As shown in fig. 5, the apparatus 500 includes an acquisition module 510 configured to acquire audio content and a reminder item. The apparatus 500 further comprises a determination module 520 configured to determine a first intermediate feature representation based on the audio feature representation of the audio content and the text feature representation of the reminder. The apparatus 500 further comprises a conversion module 530 configured to convert the first intermediate feature representation into a second intermediate feature representation using the converter module of the object model and based on a preset set of query feature representations, wherein the second intermediate feature representation corresponds to a part of the query feature representations of the set of query feature representations. The apparatus 500 further comprises a generation module 540 configured to generate content for responding to the reminder item based at least on the second intermediate feature representations.

In some embodiments, the conversion module 530 includes: a similarity determination module configured to determine a set of similarities between the first intermediate feature representation and a set of query feature representations; a target feature representation determination module configured to determine a target query feature representation from a plurality of query feature representations in response to determining that the set of query feature representations includes a plurality of query feature representations having a similarity greater than a threshold; and a first conversion module configured to convert the first intermediate feature representation into a second intermediate feature representation such that the second intermediate feature representation includes portions corresponding to the target query feature representation and does not include portions corresponding to other query feature representations of the plurality of query feature representations.

In some embodiments, the conversion module 530 includes: a second conversion module configured to convert the first intermediate representation into a third intermediate representation based on a set of query feature representations; and a mask processing module configured to process the third intermediate feature representation based on a preset mask to determine the second intermediate feature representation to preserve portions of the third intermediate feature representation corresponding to the partial query feature representation.

In some embodiments, the generating module 540 includes: a first input generation module configured to generate a first input for input to the language processing module based on the second intermediate feature representation and the prompt; and a first generation module configured to generate, by the language processing module, content for responding to the reminder item based at least on the first input.

In some embodiments, the first generation module comprises: a second input acquisition module configured to acquire a second input for input to the language processing module, the second input comprising a set of tuning parameters and a first weight parameter associated with a tuning module of the target model; and a second generation module configured to generate, by the language processing module, content for responding to the reminder item based on the first input and the second input.

In some embodiments, the second generation module comprises: a candidate content generation module configured to generate a corresponding set of candidate content based on a set of candidate weight parameters; a rating information determination module configured to determine rating information for a set of candidate content based at least on a correlation between the set of candidate content and a set of cue items; and a content determination module configured to determine content for responding to the reminder item from a set of candidate content based on the rating information of the set of candidate content.

In some embodiments, the second input acquisition module comprises: the first parameter determination module is configured to determine a first weight parameter corresponding to the prompt based on the prompt using the weight prediction module, wherein the weight prediction module is trained based on a set of training prompts and corresponding training weight parameters.

In some embodiments, the determination module 520 includes: a projection module configured to project the text feature representation to a feature dimension corresponding to the audio feature representation; and a first feature determination module configured to determine a first intermediate feature representation based on the audio feature representation and the projected text feature representation.

In some embodiments, the apparatus 500 further comprises a model training module configured to determine an initial set of query feature representations; and training the target model with the first set of training data to determine a first set of parameters and a set of intermediate query feature representations of the converter module with other parameters of the target model than the converter module and the query feature representations.

In some embodiments, the object model further comprises a text encoding module for generating a text feature representation of the reminder item, and the model training module is further configured to: the target model is trained using the second set of training data to update at least the first set of parameters of the converter module, the set of intermediate query feature representations, the second set of parameters of the text encoding module.

Fig. 6 illustrates a block diagram of an electronic device 600 in which one or more embodiments of the disclosure may be implemented. It should be understood that the electronic device 600 illustrated in fig. 6 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. The electronic device 600 illustrated in fig. 6 may be used to implement the terminal device 115, the electronic device 120, and/or the apparatus 500 illustrated in fig. 5 of fig. 1.

As shown in fig. 6, the electronic device 600 is in the form of a general purpose computing device. The components of electronic device 600 may include, but are not limited to, one or more processors or processing units 610, memory 620, storage 630, one or more communication units 640, one or more input devices 650, and one or more output devices 660. The processing unit 610 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 620. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of electronic device 600.

The electronic device 600 typically includes a number of computer storage media. Such a medium may be any available medium that is accessible by electronic device 600, including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 620 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 630 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data and that may be accessed within electronic device 600.

The electronic device 600 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 6, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 620 may include a computer program product 625 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

The communication unit 640 enables communication with other electronic devices through a communication medium. Additionally, the functionality of the components of the electronic device 600 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Thus, the electronic device 600 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 650 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 660 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 600 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., with one or more devices that enable a user to interact with the electronic device 600, or with any device (e.g., network card, modem, etc.) that enables the electronic device 600 to communicate with one or more other electronic devices, as desired, via the communication unit 640. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement the method described above, is provided. According to an exemplary embodiment of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, and computer program products implemented according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

1. A model processing method, comprising:

acquiring audio content and a prompt item;

determining a first intermediate feature representation based on an audio feature representation of the audio content and a text feature representation of the alert;

converting, using a converter module of a target model and based on a preset set of query feature representations, the first intermediate feature representation into a second intermediate feature representation, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations; and

content is generated for responding to the reminder item based at least on the second intermediate feature representation.

2. The method of claim 1, wherein the prompt is to instruct the target model to perform a target task based on the audio content, the target model being trained based on a set of training data, the target task being different from a set of training tasks corresponding to the set of training data.

3. The method of claim 1, wherein converting the first intermediate feature representation to a second intermediate feature representation comprises:

determining a set of similarities between the first intermediate feature representation and the set of query feature representations;

responsive to determining that the set of query feature representations includes a plurality of query feature representations having a similarity greater than a threshold, determining a target query feature representation from the plurality of query feature representations; and

the first intermediate feature representation is converted to the second intermediate feature representation such that the second intermediate feature representation includes portions corresponding to the target query feature representation and does not include portions corresponding to other query feature representations of the plurality of query feature representations.

4. The method of claim 1, wherein converting the first intermediate feature representation to a second intermediate feature representation comprises:

Converting the first intermediate representation to a third intermediate representation based on the set of query feature representations; and

and processing the third intermediate feature representation based on a preset mask to determine the second intermediate feature representation so as to reserve a part corresponding to the part of the query feature representation in the third intermediate feature representation.

5. The method of claim 1, wherein generating content for responding to the reminder item based at least on the second intermediate feature representation comprises:

generating a first input for input to a language processing module based on the second intermediate feature representation and the prompt; and

content for responding to the reminder item is generated by the language processing module based at least on the first input.

6. The method of claim 5, wherein generating, by the language processing module, content for responding to the reminder item based at least on the first input comprises:

obtaining a second input for input to the language processing module, the second input comprising a set of tuning parameters and a first weight parameter associated with a tuning module of the target model; and

content for responding to the reminder item is generated by the language processing module based on the first input and the second input.

7. The method of claim 6, wherein the first weight parameter is different from a second weight parameter corresponding to a training process of the target model.

8. The method of claim 6, wherein generating, by the language processing module, content for responding to the reminder based on the first input and the second input comprises:

generating a corresponding set of candidate content based on the set of candidate weight parameters;

determining rating information of the set of candidate content based at least on a correlation between the set of candidate content and the set of cues; and

the content for responding to the reminder item is determined from the set of candidate content based on the rating information of the set of candidate content.

9. The method of claim 8, wherein obtaining a second input for input to the language processing module comprises:

determining, with a weight prediction module, the first weight parameter corresponding to the cue based on the cue, wherein the weight prediction module is trained based on a set of training cues and corresponding training weight parameters.

10. The method of claim 1, wherein determining a first intermediate feature representation based on an audio feature representation of the audio content and a text feature representation of the reminder item comprises:

Projecting the text feature representation to a feature dimension corresponding to the audio feature representation; and

the first intermediate feature representation is determined based on the audio feature representation and the projected text feature representation.

11. The method of claim 1, wherein the target model is trained based at least on:

determining an initial set of query feature representations; and

the target model is trained using a first set of training data to determine a first set of parameters and a set of intermediate query feature representations of the converter module with other parameters of the target model than the converter module and query feature representations fixed.

12. The method of claim 11, wherein the first set of training data comprises a first set of audio samples corresponding to a speech recognition task.

13. The method of claim 11, wherein the object model further comprises a text encoding module for generating the text feature representation of the reminder item,

and the target model is also trained based on the following process: the target model is trained with a second set of training data to update at least the first set of parameters of the converter module, the set of intermediate query feature representations, a second set of parameters of the text encoding module.

14. The method of claim 13, wherein the second set of training data comprises a second set of audio samples and a set of training cues that correspond to speech processing tasks.

15. The method of claim 14, wherein the speech processing task comprises at least one of:

speech recognition task, speech translation task, speech question-answer task, phoneme recognition task, audio subtitle task.

16. An apparatus for model processing, comprising:

an acquisition module configured to acquire audio content and a reminder;

a determining module configured to determine a first intermediate feature representation based on an audio feature representation of the audio content and a text feature representation of the reminder;

a conversion module configured to convert the first intermediate feature representation to a second intermediate feature representation using the converter module of the object model and based on a preset set of query feature representations, wherein the second intermediate feature representation corresponds to a portion of the query feature representations in the set of query feature representations; and

a generation module configured to generate content for responding to the reminder item based at least on the second intermediate feature representation.

17. An electronic device, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the electronic device to perform the method of any one of claims 1 to 15.

18. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 15.