CN114420156A

CN114420156A - Audio processing method, device and storage medium

Info

Publication number: CN114420156A
Application number: CN202111493776.8A
Authority: CN
Inventors: 杨扬
Original assignee: Zebred Network Technology Co Ltd
Current assignee: Zebred Network Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-29

Abstract

The application relates to an audio processing method, an audio processing device and a storage medium. The method comprises the following steps: acquiring attribute information of a currently output first audio; when an output request for outputting a second audio is detected, acquiring attribute information of the second audio; determining a first output strategy according to the attribute information of the first audio and the attribute information of the second audio; and outputting audio according to the first output strategy. According to the method and the device, the first output strategy can be determined jointly according to the attribute information of the first audio and the attribute information of the second audio, and the audio output of the first audio and the second audio is controlled according to the first output strategy, so that the output strategies of the currently output audio and the audio to be output can be dynamically adjusted when an output request is detected every time, the accuracy and the flexibility of audio processing are guaranteed, and the user experience is improved.

Description

Audio processing method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method, an audio processing apparatus, and a storage medium.

Background

With the development of computer technology, people's lives have gradually entered the intelligent era nowadays, especially embodied in voice interaction. The electronic equipment can interact with the user in a voice interaction mode, so that the operation process is simplified, and the use experience of the user is improved.

Taking the electronic device as the vehicle-mounted terminal as an example, the voice interaction can be embodied on a vehicle-mounted voice interaction system. In the vehicle-mounted voice interactive system, the scenes of the voice interactive application are increasing, for example: music playing, navigation voice synthesis broadcasting, voice communication, voice recognition, various push notifications, voice prompts and the like. These application scenarios in the car-mounted voice interactive system often generate audio output conflicts, such as: when a user listens to music and a call needs to be accessed, the system can pause music playing and ring, and if the user answers the call, the system can pause music playing until the call is ended. In the related art, a fixed manner is usually adopted to solve the audio output conflict, for example, when a broadcast request of the next audio is received, the system directly suspends the audio being output or adjusts the volume of the audio being output, and the like, which may cause the voice broadcast manner to be inflexible and affect the user experience.

Disclosure of Invention

To overcome the problems in the related art, the present application provides an audio processing method, apparatus and storage medium.

According to a first aspect of embodiments of the present application, there is provided an audio processing method, including:

acquiring attribute information of a currently output first audio;

when an output request for outputting a second audio is detected, acquiring attribute information of the second audio;

determining a first output strategy according to the attribute information of the first audio and the attribute information of the second audio;

and outputting audio according to the first output strategy.

In some embodiments, the determining a first output policy based on the attribute information of the first audio and the attribute information of the second audio comprises:

determining the first output strategy according to the attribute information of the first audio obtained from multiple dimensions and the attribute information of the second audio obtained from multiple dimensions;

wherein the plurality of dimensions includes at least two of: audio type, audio content, and source of the output request.

inputting the attribute information of the first audio and the attribute information of the second audio into an arbitration model to obtain an arbitration result;

and determining a preset output strategy corresponding to the arbitration result as the first output strategy under the condition that the preset output strategy corresponding to the arbitration result exists in a preset strategy set.

In some embodiments, the method further comprises:

outputting prompt information under the condition that a preset output strategy corresponding to the arbitration result does not exist in the preset strategy set;

and the prompt information is used for prompting a user to determine an output strategy.

In some embodiments, the method further comprises:

if operation information input aiming at the prompt information is acquired within a preset time length, determining a second output strategy according to the operation information;

and outputting audio according to the second output strategy.

In some embodiments, the method further comprises:

after the second output strategy is determined according to the operation information, updating the configuration parameters of the arbitration model according to the operation information and the second output strategy corresponding to the operation information; and/or

And updating the priority of the audio output strategy corresponding to each audio according to the operation information and a second output strategy corresponding to the operation information.

In some embodiments, the method further comprises:

and if the operation information input aiming at the prompt information is not acquired within the preset time length, outputting audio according to a preset third output strategy.

According to a second aspect of embodiments of the present application, there is provided an audio processing apparatus comprising:

the first acquisition module is configured to acquire attribute information of a first audio currently output;

a second acquisition module configured to acquire attribute information of a second audio when an output request for outputting the second audio is detected;

a first determining module configured to determine a first output policy according to the attribute information of the first audio and the attribute information of the second audio;

and the first output module is configured to output audio according to the first output strategy.

In some embodiments, the first determination module is configured to:

In some embodiments, the apparatus further comprises:

the second output module is configured to output prompt information under the condition that a preset output strategy corresponding to the arbitration result does not exist in the preset strategy set;

In some embodiments, the apparatus further comprises:

the second determining module is configured to determine a second output strategy according to the operation information if the operation information input aiming at the prompt information is acquired within a preset time length;

and the third output module is configured to output audio according to the second output strategy.

In some embodiments, the apparatus further comprises:

a first updating module configured to update the configuration parameters of the arbitration model according to the operation information and a second output policy corresponding to the operation information after determining the second output policy according to the operation information; and/or

And the second updating module is configured to update the priority of the audio output strategy corresponding to each audio according to the operation information and a second output strategy corresponding to the operation information.

In some embodiments, the apparatus further comprises:

and the fourth output module is configured to output audio according to a preset third output strategy if the operation information input aiming at the prompt information is not acquired within a preset time length.

According to a third aspect of embodiments of the present application, there is provided an audio processing apparatus comprising:

a processor;

a memory configured to store processor-executable instructions;

wherein the processor is configured to: when executed, implement the steps of any of the audio processing methods of the first aspect described above.

According to a fourth aspect of embodiments herein, there is provided a non-transitory computer readable storage medium, wherein instructions, when executed by a processor of an audio processing apparatus, enable the apparatus to perform the steps of any one of the audio processing methods of the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the method and the device, the attribute information of the second audio is acquired when the output request for outputting the second audio is detected by acquiring the attribute information of the currently output first audio, then the first output strategy is determined jointly according to the attribute information of the first audio and the attribute information of the second audio, and audio output is carried out according to the first output strategy.

Compared with the prior art, the problem of audio conflict is solved by determining a fixed output strategy from a preset strategy configuration file. According to the method and the device, the first output strategy is determined jointly according to the attribute information of the first audio and the attribute information of the second audio, the audio output of the first audio and the audio output of the second audio are controlled according to the first output strategy, the determined first output strategy can be matched with the current application scene due to the fact that the adopted attribute information has real-time performance, the obtained output strategies can be different under the condition that the attribute information of the first audio and/or the second audio changes, the accuracy and the flexibility of audio processing are guaranteed, and user experience is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flow chart illustrating an audio processing method according to an exemplary embodiment of the present application.

FIG. 2A is a comparison diagram illustrating an attribute dimension according to an exemplary embodiment of the present application.

Fig. 2B is a schematic diagram illustrating an audio collision resolution system architecture according to an exemplary embodiment of the present application.

Fig. 3 is a block diagram illustrating an audio processing device according to an exemplary embodiment of the present application.

Fig. 4 is a block diagram illustrating a hardware configuration of an audio processing apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment, as shown in fig. 1, which mainly includes the following steps:

in step 101, acquiring attribute information of a currently output first audio;

in step 102, when an output request for outputting a second audio is detected, acquiring attribute information of the second audio;

in step 103, determining a first output policy according to the attribute information of the first audio and the attribute information of the second audio;

in step 104, audio output is performed according to the first output strategy.

In some embodiments, the audio processing method of the present application may be applied to an electronic device, where the electronic device may include: and the terminal equipment is, for example, a mobile terminal, a fixed terminal or a vehicle-mounted terminal. Wherein, the mobile terminal can include: the mobile phone, the tablet computer, the notebook computer or the wearable device may further include a smart home device, for example, a smart sound box. The fixed terminal may include: desktop computers or smart televisions, etc. The vehicle-mounted terminal may include a front-end device of a vehicle monitoring and management system, and may also be referred to as a vehicle scheduling and monitoring Unit (TCU) terminal, such as a vehicle-mounted terminal. The vehicle-mounted terminal can integrate the technologies of a Global Positioning System (GPS), a mileage Positioning technology, an automobile black box and the like, can be used for carrying out modern management on the vehicle, and comprises the following components: traffic safety monitoring management, operation management, service quality management, intelligent centralized scheduling management, electronic stop board control management and the like.

In this embodiment of the application, an Audio (Audio) may also be referred to as an Audio signal or a sound, a frequency range of the Audio may be 20Hz to 20KHz, and an Audio may also refer to a sound frequency of a human speaking, and a frequency range of the Audio may be 300Hz to 3400 Hz.

The attribute information may refer to information for characterizing audio characteristics, and the attribute information of the audio may include at least one of: the audio type (e.g., music or call voice), the audio content (e.g., the music content is above the moon), the name of the application that initiated the audio (e.g., internet cloud application), the state information of the audio module in the electronic device (e.g., the second speaker module is in an idle state), the current ambient loudness (e.g., the current ambient loudness is 15 db), the energy concentration region, formant frequency, formant intensity and bandwidth of the tone corresponding to the audio, and the information such as duration, fundamental frequency, average speech power that represents the audio prosodic characteristics, and the like.

In one possible embodiment, after determining that the first audio needs to be output, the electronic device may generate an output request for outputting the first audio, and upon detecting the output request, acquire attribute information of the currently output first audio. Taking the electronic device as an example of an in-vehicle device, the user may select music from a playlist in the music playing application, and the user may determine the specific music by clicking a "play" button. Then, the music playing application may generate an output request for outputting specific music in response to a user operation, transmit to the control center of the in-vehicle apparatus, or the like, and then when the control center of the in-vehicle apparatus detects the output request for outputting specific music, may acquire attribute information of the specific music that is output.

In a possible embodiment, the output request may carry attribute information of the audio, and after detecting the output request, the electronic device may parse the attribute information of the audio from the output request. For example: the electronic equipment determines that the audio type of the first audio is music by analyzing the message header structure part of the output request, and initiates attribute information such as a music playing application program and the like of the application name of the first audio.

In other embodiments, the electronic device may also perform audio analysis on the audio via an audio processing algorithm to determine attribute information of the audio. For example: the electronic equipment determines the audio content to be attribute information above the moon and the like by carrying out audio analysis on the audio.

In other embodiments, the electronic device may further store the attribute information of all the audios in a memory of the electronic device in advance, and when the electronic device needs to acquire the attribute information, the attribute information of the audios may be read from the memory of the electronic device in an active request manner. For example, after the electronic device determines an output request for outputting audio, the electronic device may parse the attribute identifier corresponding to the audio from the output request, and then query the memory for the attribute information of the audio according to the attribute identifier of the audio. In this embodiment of the application, the attribute information of the first audio and the attribute information of the second audio may be determined by both the first audio and the second audio in any one of the above manners for determining the attribute information.

The electronic device may acquire attribute information of the second audio when detecting that there is an output request for outputting the second audio in the process of outputting the first audio. Here, the second audio may refer to other audio of the electronic device than the currently output audio, in order to be distinguished from the first audio.

In some embodiments, the number of the second audios may be one or more. For example: if the first audio currently output by the electronic device is music, in the process of playing the music, the electronic device receives the notification of the call, the reminder of the system message, and the like, and then the electronic device can use the call voice and the reminder as the second audio. The electronic device may obtain the attribute information of the second audio in the same manner as or different from the manner of obtaining the attribute information of the first audio, and the present application is not particularly limited. In the embodiment of the present application, the type of information included in the attribute information of the second audio may be the same as or different from the type of information included in the attribute information of the first audio. The number of information included in the attribute information of the second audio may be the same as or different from the number of information included in the attribute information of the first audio, and the embodiment of the present application is not particularly limited. For example: the attribute information of the first audio may include an audio type and audio content, and the attribute information of the second audio may include an audio type, audio content, and a source of the output request, and the like.

After the electronic device determines the attribute information of the first audio and the attribute information of the second audio, a target output policy may be jointly determined according to the attribute information of the first audio and the attribute information of the second audio. The output strategy can be understood as a mode that the electronic equipment controls the output of the first audio and the second audio, so that the situation that the first audio and the second audio have audio conflict is prevented, and the user experience is improved. The first output strategy may refer to a first output strategy of the first audio and the second audio in order to distinguish from other possible output strategies. For example: the priority of the first audio is higher than that of the second audio, so that the electronic equipment can adopt an output strategy that the first audio is continuously played, and the second audio is played after the first audio is played; if the audio type of the first audio is the same as the audio type of the second audio, but the priority of the audio content of the first audio is lower than the priority of the audio content of the second audio, the electronic device may adopt an output strategy that the first audio and the second audio are played simultaneously, but the volume of the first audio is smaller than the volume of the second audio, and the like.

In some embodiments, the electronic device may preset a correspondence between the attribute information of the first audio and the attribute information of the second audio and the output policy, and store the correspondence in the memory of the electronic device. The electronic device may then determine a first output policy based on the attribute information of the first audio, the attribute information of the second audio, and the correspondence.

For example: the electronic device may preset that the attribute information a of the first audio and the attribute information a of the second audio correspond to an output policy a, and the attribute information B of the first audio and the attribute information B of the second audio correspond to an output policy B. If the electronic device determines that the attribute information of the first audio is a and the attribute information of the second audio is a, the electronic device may determine that the output policy a is the first output policy by querying in the memory. In one possible embodiment, the electronic device may determine the first output strategy by using a recognition model, where the recognition model may refer to a trained neural network model, input attribute information of multiple audios, and output a specific output strategy. For example: the electronic device may input the attribute information of the first audio and the attribute information of the second audio into the trained recognition model and then output the first output strategy.

After the electronic device determines the first output strategy, audio output can be performed according to the first output strategy. For example: the electronic device determines that the first output strategy is output for pausing the first audio, normally outputs the second audio, or continues to broadcast the first audio after the second audio is broadcast.

In this embodiment of the present application, the attribute information of the audio may refer to an abstract set, and the attribute information may include contents of multiple dimensions, and the more dimensions, the more detailed description of the audio may be represented. For example: the attribute information of one dimension may include an audio type, and the attribute information of two dimensions may include an audio type and audio content, and the like. The plurality of dimensions includes at least two of: audio type, audio content, and source of the output request.

Wherein, the audio types at least can include: notification voice, music, ring tone, alert voice, call voice, navigation voice, or the like; the audio content may include at least: urgent content, non-urgent content, content least expected to be interrupted, general likeness content, general notification content, and the like; sources of output requests may include at least: a native system, a mainstream application, a trusted application, or a commonly used application, etc. The electronic device may represent the audio content of the audio in the form of a content threshold, which may represent a priority of audio output (e.g., urgency, preference, etc.), such as: a content threshold of 100 may indicate that the audio is not desired to be interrupted, a content threshold of 0 may indicate that the audio may be interrupted or paused, etc. In one possible embodiment, the electronic device may also take user behavior as a dimension of the attribute information, such as: if the user marks a certain type or types of audio as liked during the use of the electronic device, the electronic device may determine that the attribute information of the certain type or types of audio may include information such as liked.

The electronic device may determine a first output policy based on attribute information of the first audio obtained from the plurality of dimensions and attribute information of the second audio obtained from the plurality of dimensions. The dimension type of the attribute information of the first audio and the dimension type of the attribute information of the second audio may be the same or different. The number of dimensions of the attribute information of the first audio and the number of dimensions of the attribute information of the second audio may be the same or different, and the present application is not particularly limited. For example: the electronic device may determine attribute information for the first audio from dimensions of two aspects, audio type and audio content, and may determine attribute information for the second audio from dimensions of three aspects, audio type, audio content, and a source of the output request.

In one possible embodiment, the electronic device may determine the requested content of the first audio based on the output request of the first audio, and then determine the audio type of the first audio based on the requested content of the first audio. The electronic device may determine the requested content of the second audio according to the output request of the second audio, and then determine the audio type of the second audio according to the requested content of the second audio. For example: the electronic device analyzes the output request, and determines that the request content is the broadcast short message notification, so that the electronic device can determine that the audio type is notification voice and the like. The electronic device may determine a source (which may also be referred to as an application source) of the output request according to the address tags by determining the address tags carried by the output request of the first audio and the output request of the second audio. For example: the electronic device determines that the address tag carried by the output request is 12345, and then the electronic device may determine, through the address tag and a correspondence between the preset address punctuation and the application program, a program name of the application program (e.g., panning), then determine a classification of each application program in advance, and determine that the application program is a mainstream application (i.e., a source of the output request is a mainstream application) according to the program name.

For another example, the electronic device may analyze the output request of the first audio to obtain a playing content identifier of the first audio, analyze the output request of the second audio to obtain a playing content identifier of the second audio, and then determine content thresholds corresponding to the respective playing content identifiers, where the content thresholds are used to represent priorities of the audios, for example, a first content threshold of the first audio may be determined, a second content threshold of the second audio is determined, and if the first content threshold is greater than the second content threshold, the priority of the first audio is higher than the priority of the second audio. For example: the electronic device determines that the playing content identifier carried in the output request of the first audio is 1, and may determine that the first content threshold is 20, and the first audio is a common notification with a lower priority, and the electronic device determines that the playing content identifier carried in the output request of the second audio is 2, and may determine that the second content threshold is 80, and the second audio is an emergency notification with a higher priority.

In one possible embodiment, the electronic device may compare the attribute information of the first audio obtained from multiple dimensions with the attribute information of the second audio obtained from multiple dimensions, and determine the output policy according to a comparison result between the attribute information of each dimension. As shown in FIG. 2A, FIG. 2A may represent a schematic diagram of attribute dimension comparison. The first row may represent attribute information of multiple dimensions, such as an audio type of the first audio, an audio content, a source of the output request, and operation information, and the first column may represent attribute information of multiple dimensions, such as an audio type of the second audio, an audio content, a source of the output request, and operation information. In some embodiments, the electronic device may pre-define an output policy corresponding to each comparison result, store the pre-defined output policy in a preset policy set, and determine the output policy after the electronic device obtains the comparison result by comparing the attribute information of each dimension. If the electronic device can determine a corresponding preset output policy in the preset policy set according to the comparison result, the electronic device may use the preset output policy as the first output policy. If the corresponding preset output policy cannot be determined in the preset policy set (for example, the output policy corresponding to a certain comparison result is not predefined), the electronic device may determine that the optimal output policy cannot be obtained from the preset policy set at this time, and may determine the output policy in another manner (for example, according to operation information of a user, etc.).

In an embodiment of the present application, a first output policy is determined according to attribute information of a first audio obtained from a plurality of dimensions and attribute information of a second audio obtained from the plurality of dimensions, where the plurality of dimensions include at least two of: the audio type, the audio content and the source of the output request can acquire the attribute information from multiple dimensions, the content of the attribute information is enriched, the first output strategy can be determined more accurately and effectively, the use habits of users are met, and the like.

In this embodiment of the application, the arbitration model (which may also be referred to as an arbitration module or an arbitration algorithm, etc.) may refer to a neural network model used in the electronic device to process attribute information of the first audio and attribute information of the second audio, and the electronic device may input the attribute information of the first audio and the attribute information of the second audio into the arbitration model, and then perform a relevant process on the arbitration model to obtain an arbitration result. The arbitration result may refer to the output of the arbitration model, and may be expressed in the form of a numerical value, a conclusion, or the like. In the embodiment of the application, the electronic device can determine the output strategy in a dynamic implementation manner. Wherein, the dynamic implementation mode can be understood as: the electronic device may output the policy by collecting or gathering historical attribute information of the first audio and historical attribute information of the second audio, and historical output policies of the first audio and the second audio. The electronic device may train and analyze the initial arbitration model according to the historical attribute information of the first audio, the historical attribute information of the second audio, and the historical output policy, so as to obtain a trained arbitration model or arbitration calculation formula. The electronic device can dynamically calculate an arbitration result (which may also be referred to as an audio playing threshold) of the first audio and the second audio through the trained arbitration model, and determine an output strategy according to the arbitration result (which may also be referred to as the audio playing threshold).

In an alternative embodiment, the arbitration result may be represented in a numerical form, for example, a numerical range to which the arbitration result belongs may be determined, and a preset output policy matching the numerical range may be determined according to the numerical range to which the arbitration result belongs. If the electronic device determines that the arbitration result belongs to the range of 0-20, the corresponding preset output strategy can be to continue playing the first audio, and play the second audio after the first audio is played; when the arbitration result belongs to the range of 21-40, the corresponding preset output strategy can be that the first audio and the second audio are played simultaneously, and the volume of the first audio is larger than that of the second audio; when the arbitration result belongs to the range of 41-60, the corresponding preset output strategy can be that the first audio and the second audio are played simultaneously, and the volume of the first audio is smaller than that of the second audio; when the arbitration result belongs to the range of 61-80, the corresponding preset output strategy can be to pause playing the first audio, normally play the second audio, continue playing the first audio after the second audio is played, and the like.

In an optional embodiment, the arbitration result is represented in a conclusion form, and when the arbitration result is that the priority of the first audio (which may be determined according to the urgency of the audio or the user's preference) is higher than the priority of the second audio, the corresponding preset output policy may be to continue playing the first audio, and after the first audio is played, play the second audio; when the arbitration result is that the priority of the first audio is lower than that of the second audio, the corresponding preset output strategy can be to pause playing the first audio, normally play the second audio, continue playing the first audio after the second audio is played, and the like.

In some embodiments, the electronic device may preset a corresponding relationship between the arbitration result and the preset output policy, combine all the preset output policies into a preset policy set, and store the preset policy set in a memory of the electronic device for subsequent query reading and the like. The electronic device may determine a preset output policy corresponding to the arbitration result as the first output policy in a case where the preset output policy corresponding to the arbitration result exists in the preset policy set. The first output policy may be understood as a policy determined from a preset set of policies in order to distinguish it from other forms of policies. For example: the electronic device may preset a first arbitration result corresponding to a first preset output policy, a second arbitration result corresponding to a second preset output policy, and the like, and if the electronic device determines that the arbitration result is the first arbitration result, the electronic device may query a preset policy set stored in the memory to obtain the first preset output policy as the first output policy, and the like.

In the embodiment of the application, the attribute information of the first audio and the attribute information of the second audio are input into the arbitration model to obtain the arbitration result, and the preset output strategy corresponding to the arbitration result is determined as the first output strategy under the condition that the preset output strategy corresponding to the arbitration result exists in the preset strategy set, so that the first output strategy can be accurately and quickly obtained, the operating efficiency of the electronic equipment is improved, and the like.

In some embodiments, the method further comprises:

In the embodiment of the application, after the electronic device determines the arbitration result, it may be determined whether a preset output policy corresponding to the arbitration result exists in the preset policy set, and if not, prompt information may be output, where the prompt information is used to prompt a user to determine the output policy. For example: when the arbitration result set by the electronic device is represented in a numerical form, the preset range of the arbitration result is 0-100, and when the arbitration result is 0-100, the preset output policy can be inquired in the preset policy set. If the arbitration result determined by the electronic device does not belong to the preset range (e.g., the arbitration result is 200), the electronic device may determine that the preset output policy corresponding to the arbitration result does not exist in the preset policy set.

When the arbitration result set by the electronic equipment is expressed in a conclusion form, the arbitration result only comprises that the priority of the first audio is higher than that of the second audio, and the priority of the first audio is lower than that of the second audio. If the arbitration result determined by the electronic device is that the priority of the first audio is equal to the priority of the second audio, or the arbitration result cannot be obtained due to factors such as arbitration model faults, the electronic device may determine that the preset output policy corresponding to the arbitration result does not exist in the preset policy set.

The prompt information may refer to information for prompting a user to select an output policy by the user. The prompt message may include an arbitration result, a determination result for determining whether a preset output policy corresponding to the arbitration result exists in the preset policy set, and the like. The electronic device can output the prompt information through the contents such as a dialog box in the form of interface output, and can also output the prompt information through the contents such as a prompt voice in the form of voice output. The electronic equipment outputs the prompt information to interact with the user, and determines an output strategy and the like through active selection of the user.

In the embodiment of the application, the prompt information is output under the condition that the preset output strategy corresponding to the arbitration result does not exist in the preset strategy set, wherein the prompt information is used for prompting a user to determine the output strategy, and can interact with the user in time, so that the output strategy is accurately determined, the stability of determining the output strategy by the electronic equipment is improved, and the like.

In some embodiments, the method further comprises:

and outputting audio according to the second output strategy.

In the embodiment of the application, after the electronic device outputs the prompt information, if the operation information input aiming at the prompt information is acquired within the preset time, the second output strategy is determined according to the operation information. The operation information may refer to operation contents of the electronic device for the prompt information by the user. The operation information may include instruction information received by the electronic device through a keyboard, a mouse, a touch display screen, or the like, or voice information received by the electronic device through a microphone, or image information received by the electronic device through a camera, or the like.

The manner of outputting the prompt message by the electronic device may include: and displaying a strategy selection list, wherein the strategy display list can be generated according to each preset output strategy in the preset strategy set and/or other preset output strategies, and a user can select a specific output strategy from the strategy selection list to generate operation information. The electronic device may also preset a default output policy, and the user may select options such as "yes" or "no" to generate operation information and the like by displaying a dialog box that prompts the user whether to select the default output policy, where the default output policy may be any preset output policy in the preset policy set, or may be other policies that do not belong to the preset policy set, and the like.

In some embodiments, the operation information may also include specific instructions of the user, such as: the specific instruction of the user may include a specific instruction to pause the output of the first audio and play the second audio. The electronic device may preset a corresponding relationship between the shortcut operation and the specific instruction, for example: the electronic device may set the detected instruction for double-clicking the touch display screen of the electronic device within a preset time period after the prompt information is output, as a specific instruction for simultaneously playing the first audio and the second audio, and the like.

After the electronic device obtains the operation information within the preset time, a second output policy may be determined according to the operation information. For example: the user selects a specific strategy from the strategy selection list, and the electronic device determines the operation information within 500 milliseconds, so that the electronic device can select the specific strategy from the strategy selection list as a second output strategy; the user generates the operation information by selecting options such as 'yes' or 'no', and if the electronic equipment determines that the user selects the 'yes' option within 500 milliseconds, the electronic equipment can take the default output strategy as a second output strategy; by directly inputting a specific instruction by a user, if the electronic device receives the specific instruction within 500 milliseconds, the electronic device may use a policy corresponding to the specific instruction as a second output policy.

In the embodiment of the application, if the operation information input aiming at the prompt information is acquired within the preset time, the second output strategy is determined according to the operation information, and the audio output is performed according to the second output strategy, so that interaction with a user can be performed in time, the output strategy can be accurately determined, and the timely output of each audio is ensured.

In some embodiments, the method further comprises:

In this embodiment, after determining the second output policy according to the operation information, the electronic device may update the configuration parameters of the arbitration model according to the operation information and the second output policy corresponding to the operation information, so as to obtain an updated arbitration model. The electronic device may then determine a target output policy based on the updated arbitration model upon detecting an output request to output the third audio. That is, the electronic device can update the arbitration model according to the relevant information every time it determines the output policy, and when the output policy is determined next time, the updated arbitration model can be used to determine the next output policy, that is, the arbitration model is dynamically updated in the audio processing process, so that it is helpful to determine the output policy each time accurately in time, and the output policy can conform to the current application scene of the user.

In this embodiment of the application, after determining the second output policy according to the operation information, the electronic device may update the priority of the audio output policy corresponding to each audio according to the operation information and the second output policy corresponding to the operation information. For example: the electronic equipment determines that the second output strategy is to pause playing the first audio according to the operation information, preferentially play the second audio, and continue playing the first audio after playing the second audio.

And in the subsequent audio processing process, if the condition that the first audio and the second audio are the same in the same scene is detected, directly adopting a second output strategy to output the audio. Whether a preset output strategy corresponding to the arbitration result exists in the preset strategy set or not does not need to be judged, and the second output strategy is used as the output strategy with the highest priority, so that the preference of the user is met, the judgment process of the electronic equipment can be simplified, the operation efficiency is improved, and the like.

In one possible embodiment, after determining the second output policy according to the operation information, the electronic device may adjust attribute information of the second audio according to the operation information, e.g., the electronic device may adjust a priority of a user behavior dimension in the attribute information, and the like. For example: the electronic device may increase the priority of the user behavior dimension, and subsequently, when determining the output policy of the second audio and the third audio, determine a new output policy and the like by using the new attribute information of the second audio and the attribute information of the third audio.

In the embodiment of the application, after the second output strategy is determined according to the operation information, the configuration parameters of the arbitration model are updated according to the operation information and the second output strategy corresponding to the operation information; and/or updating the priority of the audio output strategy corresponding to each audio according to the operation information and the second output strategy corresponding to the operation information, so that the priority of the arbitration model and the priority of the output strategy can be dynamically adjusted, the flexibility of audio processing is improved, and the audio processing method can better meet the personalized application scene of a user and the like.

In some embodiments, the method further comprises:

In the embodiment of the application, after the electronic device outputs the prompt message, if the operation information input aiming at the prompt message is not acquired within the preset time, audio output is performed according to a preset third output strategy. In some embodiments, the third output policy may be a preset default output policy, and the default output policy may be any preset output policy in a preset policy set, or may be another policy that does not belong to the preset policy set, or the like. For example: if the electronic device does not receive the operation information within 500 milliseconds, the electronic device may use a preset default output policy for simultaneously playing the first audio and the second audio as a third output policy.

In the embodiment of the application, if the operation information input aiming at the prompt information is not acquired within the preset time, the audio output is performed according to the preset third output strategy, so that the output strategy can be accurately and timely determined, the stability of the electronic equipment for determining the output strategy is improved, and the like.

In a possible embodiment, the audio processing method in the present application may be applied to an audio collision resolution system, as shown in fig. 2B, and fig. 2B may represent an architecture diagram of an audio collision resolution system. First, the dotted line box may represent attribute information of the first audio (i.e., the currently played audio), which may include an audio type, a source of an input request (which may also be referred to as an application source), and audio content (which may also be referred to as a content threshold), and the like. Wherein, the source of the input request may include: a native or local system source, a shopping application source, etc. The audio content may include: like content, emergency content, etc.

The solid line box may represent attribute information of the second audio (i.e., the audio to be played next), which may include an audio type, a source of the input request (which may also be referred to as an application source), and audio content (which may also be referred to as a content threshold), and the like. Wherein, the source of the input request may include: a native or local system source, a shopping application source, etc. The audio content may include: like content, emergency content, etc.

The arbitration model (or the conflict arbitration module, etc.) can determine the output strategy of the audio according to the attribute information of the first audio and the attribute information of the second audio, and control the arbitration of the first audio and the second audio to be effective so as to solve the audio conflict. Meanwhile, the arbitration model can also determine an output strategy according to operation information selected by a user, and the operation information can be used for adjusting audio content, an efficient arbitration model and the like. In this embodiment of the present application, the audio collision scenario may include: while the system is outputting audio, the system or application is notified, background music is played, favorite music is listened to, or a phone call from a family member is received. When the system determines that the audio conflict occurs, in order to consider the user experience, the system may adopt a certain strategy to solve the audio conflict so as to determine the optimal audio output mode.

According to the technical scheme, when the output request for outputting the second audio is detected by acquiring the attribute information of the currently output first audio, the attribute information of the second audio is acquired, then the first output strategy is determined jointly according to the attribute information of the first audio and the attribute information of the second audio, and audio output is carried out according to the first output strategy.

Compared with the prior art, the problem of audio conflict is solved by determining a fixed output strategy from a preset strategy configuration file. According to the method and the device, the first output strategy is determined jointly according to the attribute information of the first audio and the attribute information of the second audio, the audio output of the first audio, the second audio and other multiple audios is controlled according to the first output strategy, the determined first output strategy can be matched with the current application scene due to the fact that the adopted attribute information has real-time performance, the obtained output strategies can be different under the condition that the attribute information of the first audio and/or the second audio changes, the accuracy and the flexibility of audio processing are guaranteed, and user experience is further improved.

Fig. 3 is a block diagram illustrating an audio processing device according to an example embodiment. As shown in fig. 3, the audio processing apparatus 300 mainly includes:

a first obtaining module 301 configured to obtain attribute information of a currently output first audio;

a second obtaining module 302 configured to obtain attribute information of a second audio when an output request for outputting the second audio is detected;

a first determining module 303, configured to determine a first output policy according to the attribute information of the first audio and the attribute information of the second audio;

a first output module 304 configured to output audio according to the first output policy.

In some embodiments, the first determining module 303 is configured to:

In some embodiments, the apparatus 300 further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating a hardware configuration of an audio processing apparatus according to an exemplary embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: processing component 402, memory 404, power component 406, multimedia component 408, audio module 410, input/output (I/O) interface 412, sensor component 414, and communication component 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 406 provide power to the various components of device 400. The power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio module 410 is configured to output and/or input an audio signal. For example, audio module 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio module 410 further comprises a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WI-FI, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an audio processing apparatus, enable the audio processing apparatus to perform an audio processing method, comprising:

acquiring attribute information of a currently output first audio;

and outputting audio according to the first output strategy.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of audio processing, the method comprising:

acquiring attribute information of a currently output first audio;

and outputting audio according to the first output strategy.

2. The method of claim 1, wherein determining a first output policy based on the attribute information of the first audio and the attribute information of the second audio comprises:

3. The method of claim 1, wherein determining a first output policy based on the attribute information of the first audio and the attribute information of the second audio comprises:

4. The method of claim 3, further comprising:

5. The method of claim 4, further comprising:

and outputting audio according to the second output strategy.

6. The method of claim 5, further comprising:

7. The method of claim 4, further comprising:

8. An audio processing apparatus, comprising:

9. An audio processing apparatus, comprising:

a processor;

a memory configured to store processor-executable instructions;

wherein the processor is configured to: when executed, implement the steps in any of the audio processing methods of claims 1 to 7 above.

10. A non-transitory computer readable storage medium having instructions which, when executed by a processor of an audio processing apparatus, enable the apparatus to perform the steps of any of the audio processing methods of claims 1 to 7.