CN117995208A

CN117995208A - Processing method, device and equipment

Info

Publication number: CN117995208A
Application number: CN202410139544.XA
Authority: CN
Inventors: 张振龙; 胡永华; 王余良
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2024-01-31
Filing date: 2024-01-31
Publication date: 2024-05-07

Abstract

The disclosure provides a processing method, a processing device and processing equipment, which can be applied to the technical field of data processing. A method of processing comprising: obtaining audio data, wherein the audio data is the expression of a real environment; processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and processing the audio data by the audio engine based on the target parameter as target audio data, the audio effect of the target audio data being different from the audio effect of the audio data.

Description

Processing method, device and equipment

Technical Field

The disclosure relates to the technical field of data processing, and in particular relates to a processing method, a processing device and processing equipment.

Background

In practical applications, when processing audio, a specific sound is usually extracted from a piece of audio, and then the sound is synthesized with an existing background sound to obtain new audio. However, the existing audio cannot adjust the audio according to the content of the audio in the real environment during processing, resulting in a problem that the reality of the new audio is poor.

Disclosure of Invention

The disclosure provides a processing method, a processing device and processing equipment.

According to a first aspect of the present disclosure, there is provided a processing method comprising: obtaining audio data, wherein the audio data is the expression of a real environment; processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and processing the audio data by the audio engine based on the target parameter as target audio data, the audio effect of the target audio data being different from the audio effect of the audio data.

According to an embodiment of the present disclosure, the change in the audio effect of the target audio data varies with the change in the audio data.

According to an embodiment of the present disclosure, obtaining audio data includes: if a communication channel exists between the communication identifier and the communication identifier, collecting audio data, wherein the audio data is used for representing the sound of the real environment; the method further comprises the steps of: the target audio data is transmitted over the communication channel.

According to an embodiment of the present disclosure, processing audio data into target audio data by an audio engine based on target parameters includes: the audio engine processes the second audio data based on the processing model corresponding to the target parameter to obtain second target audio data; and the audio engine fuses the first audio data and the second target audio data to generate target audio data.

According to an embodiment of the present disclosure, processing audio data into target audio data by an audio engine based on target parameters includes: the audio engine processes the second audio data based on the first model to obtain first target audio data; the audio engine fuses the first audio data and the first target audio data to generate target audio data; switching from the first model to the second model by the audio engine in response to the target parameter; the audio engine processes the second audio data based on the second model to obtain second target audio data; the audio engine fuses the first audio data and the second target audio data to generate target audio data; the first model is a general processing model, the second model is a special processing model, and the audio effect of the second target audio data is better than that of the first target audio data.

According to an embodiment of the present disclosure, processing audio data into target audio data by an audio engine based on target parameters includes: the audio engine processes the second audio data based on a first model of the first configuration parameters to obtain first target audio data; the audio engine fuses the first audio data and the first target audio data to generate target audio data; the audio engine switches from the first configuration parameter to the second configuration parameter in response to the target parameter; the audio engine processes the second audio data based on the first model of the second configuration parameter to obtain second target audio data; the audio engine fuses the first audio data and the second target audio data to generate target audio data; the processing effect of the first model of the second configuration parameter is better than that of the first model of the first configuration parameter.

According to an embodiment of the present disclosure, obtaining a target parameter based on first audio data includes: analyzing sound attributes of the first audio data at different moments; and if the sound attribute of the first audio data at different moments changes, determining the target parameter.

According to an embodiment of the present disclosure, obtaining the target parameter based on the first audio data further includes:

Obtaining target parameters based on the first audio data and auxiliary parameters, wherein the auxiliary parameters are auxiliary parameters synchronously obtained in the process of obtaining the audio data; the target parameters are used to instruct the audio engine to process the audio data into target audio data, the audio data being used to characterize a first scene, the target audio data being used to characterize a second scene different from the first scene.

A second aspect of the present disclosure provides a processing apparatus comprising: the acquisition module is used for acquiring audio data, wherein the audio data is the expression of a real environment; a first processing module for processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; the second processing module is used for obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and a third processing module for processing the audio data into target audio data by the audio engine based on the target parameter, the audio effect of the target audio data being different from the audio effect of the audio data.

A third aspect of the present disclosure provides a processing apparatus comprising: the audio acquisition device is used for acquiring audio data, and the audio data is used for representing the sound of the real environment where the electronic equipment is located; the processor is used for obtaining audio data, and the audio data is the expression of the real environment; processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and processing the audio data by the audio engine based on the target parameter as target audio data, the audio effect of the target audio data being different from the audio effect of the audio data.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an application scenario diagram of a processing method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a processing method according to an embodiment of the disclosure;

FIG. 3A schematically illustrates a flow chart of a method of processing audio data based on target parameters to obtain target audio data in accordance with an embodiment of the present disclosure;

FIG. 3B schematically illustrates a flow chart of a processing method according to an embodiment of the present disclosure;

FIG. 3C schematically illustrates a schematic diagram of a method of processing audio data to obtain target audio data based on target parameters according to another embodiment of the present disclosure;

FIG. 3D schematically illustrates a flow chart of a method of processing audio data based on target parameters to obtain target audio data in accordance with yet another embodiment of the present disclosure;

FIG. 4A schematically illustrates a schematic diagram of a processing method according to an embodiment of the present disclosure;

FIG. 4B schematically illustrates a flow chart for generating background sounds using a model in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a processing device according to an embodiment of the disclosure; and

Fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a processing method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides a processing method, a processing device and processing equipment, and prior to introducing the technical scheme provided by the embodiment of the disclosure, related technologies related to the disclosure are described.

In the related art, in practical application, a specific sound is generally extracted from a piece of audio when processing the audio, and then the sound is synthesized with an existing background sound to obtain a new audio. However, existing audio cannot be processed to adjust audio according to the content of audio in a real environment.

Illustratively, when a consciously improves own speech sounds while a is talking with B, the audio engine is not able to autonomously enhance the subsequent speech sounds of a, but needs to maintain a higher volume state at all times when the subsequent speech sounds of a are required. Therefore, the existing audio processing apparatus is not capable of intelligently adjusting audio using audio data in a real environment.

The embodiment of the disclosure provides a processing method, which comprises the following steps: obtaining audio data, wherein the audio data is the expression of a real environment; processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and processing the audio data by the audio engine based on the target parameter as target audio data, the audio effect of the target audio data being different from the audio effect of the audio data. According to the method, audio data in a real environment are utilized, an instruction part and a resource part are separated from the audio data, a target parameter for audio adjustment is determined according to the instruction part, and finally the audio of the resource part is adjusted according to the target parameter, so that intelligent adjustment of the audio is realized, and ideal audio data meeting the requirements of users is obtained.

Fig. 1 schematically illustrates an application scenario diagram of a processing method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the processing method provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the processing device provided by the embodiments of the present disclosure may be generally disposed in the server 105. The processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The processing method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 4 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the processing method of this embodiment includes executing steps S210 to S240.

In step S210, audio data, which is an expression of a real environment, is obtained.

In particular, audio data may be understood as sound in a real environment that is continuously captured by an audio capturing device. The audio data may include current audio data collected in real time by the audio collection device; for example, the audio data may be the sound of the user a collected by the microphone on the mobile phone and the sound in the environment where the user a is located when the user a is talking. The audio data may also include historical audio data collected by the audio collection device. For example, the audio collection device has collected the completed audio data, or the audio collection device obtains the completed audio data from other storage devices, apparatuses, or networks through a network.

In step S220, the audio data is processed based on the audio engine, and the first audio data and the second audio data different from the first audio data are obtained.

In some embodiments, the first audio data (instruction portion) and the second audio data (resource portion) in the audio data are extracted by the audio engine. The first audio data may be an explicit voice indication, for example, when talking, the user a speaks: the amount of speech that i say next is amplified and the second audio data is what i say next user a speaks. The first audio data may also be a change in sound attribute at different times, for example, when a call is made, the user a starts to increase his own speech sound at a certain time, and the second audio data is the content of the following speech of the user a.

In step S230, a target parameter is obtained based on the first audio data, the target parameter acting on at least the audio engine to process the second audio data.

In some embodiments, the target parameter may be understood as the basis for adjusting the audio content of the second audio data, which may be to adjust the volume of the main audio, or make a change to the main audio, or adjust the volume of the background sound, or make a scene change to the background sound. Correspondingly, the second audio data can be background sound in the audio data, and can also be audio formed by main audio and the background sound in the audio data. For example, the target parameter may be obtained by analyzing a change in a sound property of the first audio data. For example, when a call is analyzed, the user a sounds suddenly increases. The target parameter will turn up the volume of the user a speaking next.

In step S240, the audio data is processed by the audio engine as target audio data based on the target parameter, the audio effect of the target audio data being different from the audio effect of the audio data.

In some embodiments, the audio engine may process audio data in real-time, or may process existing audio data. The target audio data and the audio data can be compared with each other, and the target audio data and the audio data can be the transformation of background sounds, the adjustment of the volume of main audio, and the adjustment or removal of the background sounds.

In the embodiment of the disclosure, for the audio data collected by the audio collection device, the audio engine can extract the first audio data in the audio data according to the content of the audio data, and analyze the first audio data to obtain the intention (target parameter) of the user for audio adjustment. According to the intention of the user for adjusting the audio, adjusting the corresponding audio (second audio data) to obtain target audio data meeting the requirement of the user.

Based on the embodiment of the disclosure, the method utilizes the audio data collected in the real environment, separates the instruction part and the resource part from the audio data, analyzes the instruction part to obtain the target parameter of audio adjustment, and finally adjusts the audio of the resource part according to the target parameter so as to realize intelligent adjustment of the audio and obtain the ideal audio data meeting the user requirement. The audio engine can identify the requirements of users for adjusting the audio data aiming at different audio data or can identify the requirements of users for adjusting the audio data in different time periods aiming at the same audio data, and according to the requirements of adjustment, the audio engine can adjust the audio data with different effects.

In some embodiments, the change in the audio effect of the target audio data varies with the audio data.

Specifically, the audio data is sound in a real environment collected in real time, and the target audio data is continuously adjusted according to the audio data collected in real time and the target parameters.

The audio engine firstly extracts first audio data at a first time from the audio data, and analyzes the first audio data to obtain target parameters of adjusting the audio (intention of a user to adjust the audio). And then, according to the target parameters, adjusting second audio data at a second time in real time to obtain target audio data, wherein the second time is later than the first time.

Illustratively, the first and second modules are connected to one another. In the time dimension of the audio data, the background sound of the second audio data may be adjusted in connection with the time of the second audio data acquisition.

The concrete implementation is as follows: the user A is early in the house to communicate with the user B, and the mobile phone microphone collects the voice of the user A and the voice of the environment where the user A is located as audio data. In the call, user a speaks: "switch me background of the next speech to the vegetable market". The audio engine obtains the first audio data "switch the context of i'm next speaking to home" (i.e., instruction portion). Then, the audio engine switches the background of the next call (second audio data) of the user a to the sound of the breakfast market. If the user communicates with the user B in the afternoon, the user switches to the afternoon market sound when the background is adjusted. While the morning market sounds more intense than the afternoon market sounds.

In the content dimension of the audio data, the background sound of the second audio data may be adjusted in connection with the content of the second audio data collection.

The concrete implementation is as follows: the user A communicates with the user B in the vegetable market, and the mobile phone microphone collects the voice of the user A and the voice in the environment where the user A is located as audio data. In the call, user a speaks: "switch me background of the next speech to home". The audio engine obtains the first audio data "switch the context of i'm next speaking to home" (i.e., instruction portion). Then, the audio engine switches the background of the call (second audio data) next by the user a to home. If intermittent sounds of the puppy appear in the background sound of the second audio data. The audio engine may retain intermittent screaming of the puppy and the intermittent times are consistent when performing the background sound conversion. The sound properties of the puppy call may also be adjusted, for example, the call to gold hair in the vegetable market may be adjusted to the call to tidy raised in the user a's home.

It can be understood that when the audio engine processes the audio data, the audio engine can adjust according to the time and the content of the second audio data acquired in real time and combining with the target parameters to obtain the target audio data more meeting the requirements of the user. The target audio data is more real.

In some embodiments, obtaining audio data includes:

if a communication channel exists between the communication identifier and the communication identifier, collecting audio data, wherein the audio data is used for representing the sound of the real environment;

the method further comprises the steps of:

The target audio data is transmitted over the communication channel.

Specifically, the audio data is data acquired in real time based on the current environment. Such as voice calls, audio or video recordings, etc. Therefore, in the embodiment disclosed in the present application, the first audio data may be an explicit control instruction, a voice of calling, or a voice of instant messaging.

Illustratively, when the user a is talking with the user B, after the handset microphone collects the audio data of the user a (the voice of the user a speaking and the voice of the environment where the user a is located), the audio engine extracts the first audio data (for example, what is what you do not speak of the user a.

It can be understood that the audio data are collected in real time, and the audio data are processed (noise reduction, main audio volume adjustment or background sound switching) in time according to the content in the audio data, and the audio data are transmitted in real time, so that the requirement of a user on audio processing can be better met.

In other embodiments, when user a is engaged in a call with user B, if user B speaks to user a: the sound of you speaking is too small and I can not hear clearly, at which time user A will increase the volume of his own speech. Therefore, after the mobile phone microphone collects the audio data of the user a (the voice of the user a speaking and the voice in the environment where the user a is located), the audio engine extracts the first audio data: the volume of the speech of user a at a certain moment is suddenly increased from small. And analyzing the section of audio data to obtain target parameters (namely, the voice of the user A speaking later is amplified), processing the second audio data according to the target parameters (namely, the voice of the user A speaking later is amplified when the user A is in a conversation later) to obtain target audio data, and transmitting the target audio data to the user B in real time after processing, wherein the user B hears the audio of the user A speaking later after the voice of the user A speaking later is amplified.

It is understood that the audio engine can analyze real-time changes of the audio content by the user in the audio data, and analyze the requirements of the user on the audio data, so as to process in time.

Fig. 3A schematically illustrates a flowchart of a method of processing audio data based on target parameters to obtain target audio data according to an embodiment of the present disclosure. Fig. 3B schematically illustrates a flow chart of a processing method according to an embodiment of the present disclosure.

In some embodiments, as shown in fig. 3A, performing step S240, processing the audio data into target audio data by the audio engine based on the target parameters includes performing steps S241 a-S242 a.

In step S241a, the audio engine processes the second audio data based on the processing model corresponding to the target parameter to obtain second target audio data.

In step S242a, the audio engine fuses the first audio data and the second target audio data to generate the target audio data.

It should be appreciated that in the related art, the network model is single stationary when the audio is adjusted by the trained network model. For example, the network model in the related art is to enhance the volume of the main audio in the audio data. The audio data cannot be adjusted to identify the user's different intent for audio adjustment for different audio data.

In one implementation, after the audio data is collected, the first audio data is extracted and then analyzed to obtain a target parameter, and a processing model is selected according to the target parameter to process the second audio data to obtain the target audio data. Different target parameters correspond to one process model. The corresponding processing model can be adopted for audio processing on both current audio data and historical audio data.

For example, as shown in fig. 3B, for the audio data of the user a when the user a calls, a large language model is used to monitor the user a calls, and when the model monitors that there is an instruction of changing a scene in the audio data of the call, a corresponding processing model of changing the scene is matched according to the target scene, and a new background sound is generated through the corresponding processing model of changing the scene. And then separating the voice of the user A from the real background sound during the call, and synthesizing the voice of the user A and the new background sound into target audio data and outputting the target audio data.

For example, when the target parameter is characterized as a turn-down background sound, a model of the turn-down background sound in the audio may be selected for processing; when the target parameter is characterized as switching background sounds, namely switching the background of the vegetable market in the audio into the background of the home, and selecting a model switched from the vegetable market to the background sounds in the home for processing; when the target parameter is characterized as switching background sounds, namely switching the background in the home in the audio to the background of the vegetable market, a model switched from the background sounds in the home to the background sounds in the vegetable market can be selected for processing. According to different user requirements, one processing model corresponding to the requirements is selected.

In another implementation, for audio data collected in real-time (e.g., voice calls), the target audio data can be output in real-time after the audio data processing is completed using the corresponding processing model.

Fig. 3C schematically illustrates a schematic diagram of a method of processing audio data based on target parameters to obtain target audio data according to another embodiment of the present disclosure.

In other embodiments, as shown in fig. 3C, performing step S240, processing the audio data into target audio data by the audio engine based on the target parameters includes performing steps S241 b-S245 b.

In execution of step S241b, the audio engine processes the second audio data based on the first model to obtain first target audio data.

In step S242b, the audio engine fuses the first audio data with the first target audio data to generate target audio data.

In executing step S243b, the audio engine switches from the first model to the second model in response to the target parameter.

In execution of step S244b, the audio engine processes the second audio data based on the second model to obtain second target audio data.

In step S245b, the audio engine fuses the first audio data and the second target audio data to generate the target audio data.

The first model is a general processing model, the second model is a special processing model, and the audio effect of the second target audio data is better than that of the first target audio data.

In particular, for audio processing, different models may be used to process audio data, resulting in audio of different processing effects. The concrete implementation is as follows: after the audio data are collected, at a first moment, a first audio data is extracted, then analysis is carried out to obtain a first target parameter, a first model is selected according to the first target parameter to process the second audio data, and the target audio data are obtained. At a second moment, another first audio data is extracted, then analysis is carried out to obtain a second target parameter, a second model is selected according to the second target parameter, namely, the first model for processing the second audio data is switched into the second model, and the target audio data is obtained. Wherein the second time is later than the first time.

Illustratively, when the user a is in a voice call, the first model (a general noise reduction model) will reduce the noise of the audio of the user a to a certain extent, so that the user B can hear the content of the speech of the user a. If during the call, user a says: further noise reduction, then "user a say: further noise reduction "and obtaining target parameters according to the analysis of the further noise reduction. And switching the first model into a second model (special noise reduction model) according to the target parameters, and adopting the second model to sufficiently reduce noise of the voice of the next conversation of the user A.

For audio data (e.g., voice call) collected in real time, different models are selected at different times according to the processing requirements of users to process the audio with different effects, and the processed target audio data is output in real time.

It should be appreciated that since the audio data is continuously acquired sound data, the user's demand (target parameter) can also be changed against a change in time. Thus, both the current audio data and the historical audio data can be subjected to audio processing by switching the models at different times.

It can be appreciated that during audio processing, a more suitable model can be selected according to the change of the target parameter, so as to improve the audio processing effect.

In other embodiments, the first model and the second model may be two different types of process models.

Illustratively, after the audio data is collected, at a first moment, a first audio data is extracted and then analyzed to obtain a first target parameter, and a first model is selected according to the first target parameter to process the second audio data to obtain target audio data. At a second moment, another first audio data is extracted, then analysis is carried out to obtain a second target parameter, a second model is selected according to the second target parameter, namely, the first model for processing the second audio data is switched into the second model, and the target audio data is obtained. Wherein the second time is later than the first time.

For example, the first model may be a noise reduction model and the second model may be a background sound switching model.

Fig. 3D schematically illustrates a flowchart of a method of processing audio data based on target parameters to obtain target audio data according to yet another embodiment of the present disclosure.

In still other embodiments, as shown in fig. 3D, performing step S240, processing the audio data into target audio data by the audio engine based on the target parameters includes performing steps S241 c-S245 c.

In execution of step S241c, the audio engine processes the second audio data based on the first model of the first configuration parameters to obtain first target audio data.

In step S242c, the audio engine fuses the first audio data and the first target audio data to generate target audio data.

In executing step S243c, the audio engine switches from the first configuration parameter to the second configuration parameter in response to the target parameter.

In execution of step S244c, the audio engine processes the second audio data based on the first model of the second configuration parameters to obtain second target audio data.

In step S245c, the audio engine fuses the first audio data and the second target audio data to generate the target audio data.

The processing effect of the first model of the second configuration parameter is better than that of the first model of the first configuration parameter.

Specifically, the same model can be adopted for audio processing, but when different audio effect processing is performed, configuration parameters of the model can be selectively changed for processing, and audio data is processed through the models with different configuration parameters, so that audio with different processing effects is obtained. The concrete implementation is as follows: after the audio data are collected, at a first time, extracting a first audio data, analyzing the first audio data to obtain a first target parameter, selecting a first model of a first configuration parameter according to the first target parameter, and processing second audio data to obtain target audio data. And at a second time, extracting another first audio data, analyzing to obtain a second target parameter, and selecting a first model of the second configuration parameter according to the second target parameter, namely switching the first configuration parameter of the first model into the second configuration parameter to obtain the target audio data. Wherein the second time is later than the first time.

For example, in a piece of audio data, three persons speak in the first time, and sound of the three persons needs to be changed, and then the first model (the model for changing the sound of the three persons) of the first configuration parameters is adopted to complete sound changing of the sound of the three persons. Five persons are speaking in the second time, and the first model using the first configuration parameters cannot make a sound change to the five persons' voice. Therefore, when five different sounds are analyzed, the first configuration parameters of the first model are switched to the second configuration parameters (the model for the five persons to make sound), and the five persons can make sound.

For audio data (e.g., voice call) collected in real time, different users in the audio are subjected to sound changing according to models with different configuration parameters selected by different numbers of users at different times, and target audio data after sound changing is finished is output in real time.

It should be appreciated that since the audio data is continuously acquired sound data, the user's demand (target parameter) can also be changed against a change in time. Therefore, the current audio data and the historical audio data can be subjected to audio processing by switching the model configuration parameters at different moments.

It can be appreciated that during audio processing, more applicable configuration parameters can be selected according to the change of the target parameters, so as to improve the audio processing effect.

In some embodiments, performing step S230, obtaining the target parameter based on the first audio data includes:

the sound properties of the first audio data at different moments are analyzed.

And if the sound attribute of the first audio data at different moments changes, determining the target parameter.

Sound attributes include pitch, timbre, and volume, among others.

In one implementation, the target parameter may be determined by analyzing the change in pitch, timbre, and volume of the sound at different times in the first audio data.

For example, when the user a and the user B talk, the user a walks from the house to the street, and when the user a walks to the street, the user a naturally increases the volume when talking because the environment is noisy. The second moment of the voice of the user a speaking is greater than the first moment of the voice, and the target parameter that needs to turn up the voice of the user a speaking can be obtained. Thus, the model for adjusting the volume of the main audio can be selected to process the audio data.

For example, when user a and user B are talking, user B speaks you a little bit, i hear you speak nothing, user a naturally increases the volume when talking next. The second moment of the voice of the user a speaking is greater than the first moment of the voice, and the target parameter that needs to turn up the voice of the user a speaking can be obtained. Thus, the model for adjusting the volume of the main audio can be selected to process the audio data.

For example, the voice of different people in the audio data is changed, and three people in the first audio data at the first time are speaking through timbre analysis, and the voice changing of three people can be completed by adopting a first model (a model for changing the voice of three people) of the first configuration parameters. Five persons in the first audio data at the second time are obtained to speak through timbre analysis, and the first configuration parameters of the first model are switched to the second configuration parameters (the model for the five persons to become voice).

It can be understood that the intention of the user to adjust the audio data can be obtained according to the change analysis of the sound attribute at different moments, so that the intelligent adjustment of the audio data is realized.

In other embodiments, at least one keyword is obtained from the first audio data; and determining the target parameter according to the at least one keyword.

For example, the keyword-determining target parameter may be extracted from the content of the user a speaking. For example, user a will: my sound is rendered. The keywords "i sound" and "change sound" are extracted. The target parameter may be determined to make a sound to the next sound of user a.

In some embodiments, performing step S230, obtaining the target parameter based on the first audio data further includes:

The auxiliary parameter may include at least one of a location of the audio acquisition device, a system time, weather information, or a background sound type of the first audio data.

Specifically, as described above, the user a talks with the user B at home in the morning, and the mobile phone microphone collects the sound of the user a talking and the sound in the environment where the user a is located as audio data. In the call, user a speaks: "switch me background of the next speech to the vegetable market". The audio engine obtains the first audio data "switch the context of i'm next speaking to home" (i.e., instruction portion). At this time, when selecting a model for switching background sounds, the audio engine does not know which model to select for processing. For example, whether to switch from bar to market or from home to market. Thus, it is necessary to determine the specific model process type by means of the auxiliary parameters. For example, the audio engine may analyze background sounds in the first audio data to obtain that user a is at home; the audio engine may also obtain that user a is at home by analyzing the location of the audio collection device. By means of the auxiliary parameters derived by the audio engine analysis, a model is selected which is switched from home to the vegetable market (the model here can be understood as the first model) to switch the background sound of the user a speaking next.

Further, for more fitting reality. The audio engine can also determine that the talk time is 8 points by analyzing the system time of the audio acquisition device, and the voice alarm degree in the vegetable market is higher than other times. Thus, the model of switching home to the vegetable market may be changed to the model of switching home to the vegetable market in the morning (the model herein may be understood as the second model). Finally, the audio engine switches the context of the next call (second audio data) by user a from home to the sound of the breakfast market.

It should be noted that the background sound may also be switched by switching the mode configuration parameters. The home is switched to the model of the vegetable market (here the first configuration parameters of the model) and the home is switched to the model of the vegetable market in the morning (here the second configuration parameters of the model).

It should be noted that, the auxiliary parameter may be used as data for assisting in audio processing, and may be obtained by analyzing the first audio data, or may be obtained directly by an explicit instruction, for example, the user a says that: the background of i'm next speaking is switched from home to background of the breakfast vegetable market. The audio engine directly matches the background sound model for switching home to the morning market.

It can be appreciated that the auxiliary parameters can help the audio engine to better select models or configuration parameters of audio processing, thereby enabling more accurate processing of audio data.

To facilitate an understanding of the processing methods of the embodiments of the present disclosure, a description will now be given with reference to fig. 4A and 4B.

FIG. 4A schematically illustrates a schematic diagram of a processing method according to an embodiment of the present disclosure; fig. 4B schematically illustrates a flow chart for generating background sounds using a model according to an embodiment of the disclosure.

As shown in fig. 4A and 4B, the audio is spoken for the user, and the target parameter is the application scenario to switch the background sound of the user speaking. Audio data (containing instructions and content to be processed) is input to the audio engine (i.e., AI agent). The audio engine recognizes first audio data (instruction portion) and second audio data (resource portion) in the audio data, and analyzes the first audio data to obtain an instruction (target parameter). According to the instruction, the audio engine performs voiceprint analysis on the first audio data/the second audio data. The audio engine finds out the sound characteristics of the target object according to the volume threshold/tone, and the sound of the target object can be the sound of a person, the sound of an animal or other sounds. And then separating the sound of the target object from the second audio data according to the sound characteristics of the target object. For scenes with higher real-time performance, the sound of the target object can be separated by adjusting a threshold value or increasing noise. And generating target background sound through the AI model according to the instruction. The AI model can perform semantic analysis on the instruction, acquire auxiliary parameters (time, position or weather) to generate an explicit instruction, and generate accurate background sound according to the explicit instruction, wherein the instruction contains explicit background sound related description. And then synthesizing the sound of the target object and the target background sound to obtain target audio data. Based on the processing method, the disclosure further provides a processing device. The device will be described in detail below in connection with fig. 5.

Fig. 5 schematically shows a block diagram of a processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the processing apparatus 500 of this embodiment includes an acquisition module 510, a first processing module 520, a second processing module 530, and a third processing module 540.

The obtaining module 510 is configured to obtain audio data, where the audio data is representative of a real environment. In an embodiment, the obtaining module 510 may be configured to perform the foregoing performing step S210, which is not described herein.

The first processing module 520 is configured to process the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data. In an embodiment, the first processing module 520 may be configured to perform the foregoing performing step S220, which is not described herein.

The second processing module 530 is configured to obtain a target parameter based on the first audio data, where the target parameter acts on at least the audio engine to process the second audio data. In an embodiment, the second processing module 530 may be configured to perform the foregoing performing step S230, which is not described herein.

The third processing module 540 is configured to process the audio data into target audio data by the audio engine based on the target parameter, wherein an audio effect of the target audio data is different from an audio effect of the audio data. In an embodiment, the third processing module 540 may be configured to perform the foregoing performing step S240, which is not described herein.

According to an embodiment of the present disclosure, any of the plurality of modules of the acquisition module 510, the first processing module 520, the second processing module 530, and the third processing module 540 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the acquisition module 510, the first processing module 520, the second processing module 530, and the third processing module 540 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Or at least one of the acquisition module 510, the first processing module 520, the second processing module 530 and the third processing module 540 may be at least partially implemented as computer program modules which, when executed, may perform the respective functions.

An electronic device 600, comprising: an audio acquisition device 601 and a processor 602.

The audio collection device 601 is used for collecting audio data, and the audio data is used for representing sound of a real environment where the electronic device is located.

In some embodiments, the audio acquisition device 601 may be a microphone or an audio sampling card. The audio acquisition device can be arranged in the electronic equipment in an integrated manner, and can also be independent of the electronic equipment and establish communication connection with the electronic equipment.

The processor 602 is configured to obtain audio data, where the audio data is representative of a real environment; processing the audio data based on the audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, wherein the target parameters at least act on the audio engine to process the second audio data; and processing the audio data by the audio engine based on the target parameter as target audio data, the audio effect of the target audio data being different from the audio effect of the audio data.

The electronic device 600 further comprises a memory in which a computer program is stored which, when executed by the processor 602, enables the implementation of the processing method according to any of the previous embodiments.

The processor 602 can include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 602 may also include on-board memory for caching purposes. The processor 602 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

Memory, which may be volatile memory (RAM), such as random access memory (Random Access Memory); or a nonvolatile Memory (non-volatile Memory), such as Read-Only Memory (ROM), flash Memory, hard disk drive (HARD DISK DRIVE, HDD), or Solid state disk (Solid STATE DISK, SSD); or a combination of the above types of memory and provides instructions and data to the processor 602.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of processing, comprising:

obtaining audio data, wherein the audio data is expression of a real environment;

Processing the audio data based on an audio engine to obtain first audio data and second audio data different from the first audio data;

Obtaining target parameters based on the first audio data, the target parameters acting on at least the audio engine to process the second audio data; and

And processing the audio data into target audio data by the audio engine based on the target parameters, wherein the audio effect of the target audio data is different from that of the audio data.

2. The method of claim 1, wherein the change in audio effect of the target audio data is a function of the audio data.

3. The method of claim 2, wherein the obtaining audio data comprises:

If a communication channel exists between the communication identifier and the communication identifier, acquiring audio data, wherein the audio data is used for representing the sound of the real environment;

The method further comprises the steps of:

And transmitting the target audio data through the communication channel.

4. A method according to claim 1 or 3, wherein the processing of the audio data by the audio engine into target audio data based on the target parameters comprises:

The audio engine processes the second audio data based on a processing model corresponding to the target parameter to obtain second target audio data; and

The audio engine fuses the first audio data and the second target audio data to generate the target audio data.

5. A method according to claim 1 or 3, wherein the processing of the audio data by the audio engine into target audio data based on the target parameters comprises:

The audio engine processes the second audio data based on a first model to obtain first target audio data;

The audio engine fuses the first audio data and the first target audio data to generate the target audio data;

The audio engine switching from the first model to a second model in response to the target parameter;

the audio engine processes the second audio data based on the second model to obtain second target audio data; and

The audio engine fuses the first audio data and the second target audio data to generate the target audio data;

6. A method according to claim 1 or 3, wherein the processing of the audio data by the audio engine into target audio data based on the target parameters comprises:

The audio engine processes the second audio data based on a first model of a first configuration parameter to obtain first target audio data;

the audio engine switching from the first configuration parameter to a second configuration parameter in response to the target parameter;

The audio engine processes the second audio data based on a first model of a second configuration parameter to obtain second target audio data; and

7. A method according to claim 1 or 3, wherein the obtaining target parameters based on the first audio data comprises:

Analyzing sound attributes of the first audio data at different moments; and

8. The method of claim 1, wherein the obtaining target parameters based on the first audio data further comprises:

Obtaining target parameters based on the first audio data and auxiliary parameters, wherein the auxiliary parameters are auxiliary parameters synchronously obtained in the process of obtaining the audio data; the target parameters are used to instruct an audio engine to process the audio data into target audio data, the audio data being used to characterize a first scene, the target audio data being used to characterize a second scene different from the first scene.

9. A processing apparatus, comprising:

The acquisition module is used for acquiring audio data, wherein the audio data is the expression of a real environment;

A first processing module for processing the audio data based on an audio engine to obtain first audio data and second audio data different from the first audio data;

a second processing module for obtaining a target parameter based on the first audio data, the target parameter acting on at least the audio engine to process the second audio data; and

And the third processing module is used for processing the audio data into target audio data through the audio engine based on the target parameters, wherein the audio effect of the target audio data is different from that of the audio data.

10. An electronic device, comprising:

the audio acquisition device is used for acquiring audio data, and the audio data are used for representing the sound of the real environment where the electronic equipment is located;

The processor is used for obtaining audio data, and the audio data is expression of a real environment; processing the audio data based on an audio engine to obtain first audio data and second audio data different from the first audio data; obtaining target parameters based on the first audio data, the target parameters acting on at least the audio engine to process the second audio data; and processing the audio data into target audio data by the audio engine based on the target parameters, wherein the audio effect of the target audio data is different from that of the audio data.