CN118042042A

CN118042042A - Audio data processing method and related device

Info

Publication number: CN118042042A
Application number: CN202211410820.9A
Authority: CN
Inventors: 许集润
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2024-05-14

Abstract

The embodiment of the application provides an audio data processing method and a related device, wherein the method comprises the following steps: acquiring audio data to be processed and identifying a first scene, wherein the first scene is a scene where a sound source for generating the audio data to be processed is located; inputting the audio data to be processed into an event recognition model to perform event recognition, so as to obtain a plurality of candidate events; and taking the candidate event belonging to the first scene in the plurality of candidate events as an event identification result of the audio data to be processed. The application can improve the accuracy of identifying the event from the audio data.

Description

Audio data processing method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio data processing method and a related device.

Background

With the continuous development of computer science and technology, electronic devices such as mobile phones play an increasingly important role in daily life. For example, it is inconvenient for users of hearing impairment to hear themselves or hear sounds such as ambient sounds and speech sounds, etc.

The electronic equipment can collect audio data in the environment, input the audio data into the event recognition model to perform event recognition, and then feed back the obtained event recognition result to the hearing impairment user in a non-voice mode, so that the hearing impairment user can know events occurring in the environment conveniently, and the life happiness index of the hearing impairment user is improved.

However, when the event recognition is performed on the audio data in the above manner, the false recognition rate is high.

Disclosure of Invention

The embodiment of the application provides an audio data processing method and a related device, and the accuracy of identifying events from audio data can be improved through the method and the device.

In a first aspect, an embodiment of the present application provides an audio data processing method, including:

Acquiring audio data to be processed and identifying a first scene, wherein the first scene is a scene where a sound source for generating the audio data to be processed is located;

Inputting the audio data to be processed into an event recognition model to perform event recognition, so as to obtain a plurality of candidate events;

And taking the candidate event belonging to the first scene in the plurality of candidate events as an event identification result of the audio data to be processed.

In the embodiment of the present application, the audio data to be processed may be understood as audio data in which an event needs to be identified. In other schemes, after the electronic device obtains the audio data to be processed, the audio data to be processed is directly input into an event recognition model to perform event recognition, so as to obtain an event corresponding to the audio data to be processed. In this solution, however, the determining, by the electronic device, the event corresponding to the audio data to be processed is performed based on a first scene, that is, the electronic device identifies the first scene before determining the event corresponding to the audio data to be processed.

In the embodiment of the present application, the first scenario may be understood as a key scenario in the following embodiment, that is, a scenario identified by the electronic device and where a sound source generating the audio data to be processed is located. The sound source can be understood as an object generating sound, such as a television for playing a movie, a notebook computer for playing music, a code scanning gun for scanning codes successfully, and the like. In this embodiment, the sound source generating the audio data to be processed is understood as a sound source generating sound, wherein the sound is processed to be the audio data to be processed.

For ease of understanding, the first scene described above may be a home scene (or referred to as a home scene), an office scene, a subway scene, or the like, by way of example. Also, for example, the audio data to be processed may be audio data obtained by processing at least one of sounds of operation of a washing machine in a home scene, a warning sound of turning on or off an air conditioner, and a warning sound outputted from a microwave oven after heating food.

In this embodiment, the manner of acquiring the audio data to be processed may refer to the description of step 501 in the following embodiment, and after the electronic device collects the sound signal by itself, the processing procedure of processing the sound signal may refer to the related description of fig. 6, which is not repeated herein. In this embodiment, the audio data to be processed may be understood as one or more frames of audio data, or may be understood as a piece of audio data.

In this embodiment, the order of acquiring the audio data to be processed and identifying the first scene is not limited, and for example, the electronic device may acquire the audio data to be processed first and then identify the first scene; the electronic device may also identify the first scene first and then acquire the audio data to be processed. In the second mode, the electronic device may periodically identify a scene, and take a latest scene identification result before a time of acquiring the audio data to be processed as the first scene.

It will be appreciated that in a real-world situation, the occurrence of an event may be sudden, such as a sudden opening of an air conditioner, a sudden ringing of a door bell, etc. However, the scene in which the event occurs is generally not abrupt, because the scene itself has a certain geographical range, and the movement of the user does not cause the scene to change frequently in a short time. Therefore, whether the audio data to be processed is acquired first and then the scene is identified, or whether the scene is identified first and then the audio data to be processed is acquired, the identified scene can be considered as the scene where the sound source generating the audio data to be processed is located, that is, the first scene.

In the embodiment of the present application, the type of the event recognition model may be a neural network model, for example, a convolutional neural network model, a deep neural network model, a cyclic neural network model, and the like, which is not limited in the present application. It can be understood that, after the audio data to be processed is input into the event recognition model, the event recognition model may obtain a plurality of candidate events and recognition probability of each candidate event. For example, after the audio data to be processed is input into the event recognition model, the recognition result of the event recognition model considers that the event corresponding to the audio data to be processed is considered to be the event a with a probability of 10%, the event corresponding to the audio data to be processed is considered to be the event B with a probability of 15%, and the event corresponding to the audio data to be processed is considered to be the event C with a probability of 3%.

Although the event recognition model may obtain a plurality of candidate events, in the embodiment of the present application, candidate events belonging to the first scenario among the plurality of candidate events are used as event recognition results of the audio data to be processed. According to the method and the device, through the range limitation of the scenes, the events which cannot occur in the first scene or the events which occur in small probability are eliminated, so that the accuracy of identifying the events from the audio data is improved.

Alternatively, the candidate event may be understood as a reference event in the later embodiment, for example, the embodiment shown in fig. 8.

It will be appreciated that in some special cases, there may be a plurality of events that may occur in one scene, and the sounds generated by different types of events may be similar, for example, in a home environment, the alert tone for turning on the air conditioner may be similar to the alert tone for turning on the television, so that there may be a plurality of candidate events belonging to the first scene among the plurality of candidate events, and optionally, one candidate event with the highest recognition probability may be used as the event recognition result of the audio data to be processed.

With reference to the first aspect, in one possible implementation manner, the determining, as an event recognition result of the audio data to be processed, a candidate event belonging to the first scene from the plurality of candidate events includes:

Acquiring a first probability of occurrence of each candidate event in the first scene, wherein the first probability is obtained by counting the number of occurrence of each candidate event in the first scene;

And taking the candidate event with the largest operation result of the first probability and the second probability of the plurality of candidate events as an event recognition result of the audio data to be processed, wherein the second probability is the recognition probability of each candidate event obtained by carrying out event recognition on the audio data to be processed by the event recognition model.

In this embodiment, the first probability and the second probability are different from each other, where the first probability is obtained by counting the number of times each of the candidate events occurs in the first scene, and the second probability is a recognition probability of each of the candidate events obtained by performing event recognition on the audio data to be processed by the event recognition model. That is, the first probability is a probability obtained by counting a large number of scenes and events occurring in the scenes, and the second probability is a probability that the event recognition model considers the audio data to be processed as a certain event.

The first probability may be obtained by counting the occurrence of events occurring next in different regions, for example. Taking a market scene as an example, for example, whether an escalator exists in a market in the area A, whether a microwave oven exists, whether a code scanning gun exists or not and the like can be counted, so that the occurrence probability of escalator sounds in the market scene, the occurrence probability of microwave oven sounds in the market scene and the occurrence probability of the code scanning gun in the market scene are obtained. Illustratively, counting the occurrence of certain events in various scenarios may result in a probability distribution as shown in FIG. 10.

Alternatively, the above-described first probability may be understood as a second reference probability in the embodiment shown in fig. 8 hereinafter, and the above-described second probability may be understood as a first reference probability in the embodiment shown in fig. 8 hereinafter.

In the embodiment of the present application, the operation between the first probability and the second probability may be determined according to the actual situation, for example, may be direct multiplication, or may be normalization after multiplication, for example, the product result of the first probability and the second probability may be understood as the third reference probability in the embodiment shown in fig. 8 hereinafter.

In this embodiment, the plurality of candidate events are reselected according to the first probability of each candidate event occurring in the first scenario, so that, among the plurality of candidate events, the candidate event belonging to the first scenario becomes an event recognition result of the audio data to be processed, thereby improving the accuracy of event recognition.

For further details, reference may be made to the following descriptions of fig. 8, fig. 9, fig. 10 and fig. 11 for further description of the present embodiment, which will not be repeated here.

With reference to the first aspect, in one possible implementation manner, the event recognition model is a model obtained by acquiring audio sample data in the first scene and training the event recognition model to be trained; the step of using the candidate event belonging to the first scene as the event recognition result of the audio data to be processed, among the plurality of candidate events, includes:

and taking the candidate event with the largest recognition probability of each candidate event, which is obtained by carrying out event recognition on the audio data to be processed by the event recognition model, as an event recognition result of the audio data to be processed.

In this embodiment, the event recognition model for performing event recognition on the audio data to be processed is a model obtained by acquiring and training audio sample data in the first scene. Alternatively, in this embodiment, the event recognition model may be referred to as an event recognition model corresponding to the first scenario, or may be understood as an event recognition model corresponding to a key scenario in the embodiment shown in fig. 5 hereinafter.

It can be understood that after the training of the event recognition model to be trained by collecting the audio sample data from the first scene, the event recognition model obtained after training changes in model parameters relative to the event recognition model before training, so that the accuracy of event recognition of the audio data in the first scene by the event recognition model after training can be improved, that is, the accuracy of event recognition from the audio data to be processed is improved.

With reference to the first aspect, in one possible implementation manner, a time interval between a time when the audio data to be processed is collected and a time when the first scene is identified is less than or equal to a first threshold value.

It will be appreciated that the audio data to be processed may be collected by the execution body itself, or may be obtained from other devices. In the embodiment of the present application, the first threshold may be determined according to practical situations, which is not limited by the present application. Illustratively, in the case where the audio data to be processed is audio data 5 seconds long, the first threshold may be any non-zero value less than 10 seconds.

In this embodiment, the time interval between the time of collecting the audio data to be processed and the time of identifying the first scene is less than or equal to the first threshold, so that the situation that the scene where the sound source generating the audio data to be processed is located is not matched with the first scene can be prevented. It will be appreciated that if the time interval between the time of acquisition of the audio data to be processed and the time of recognition of the first scene is too long, the first scene recognized by the electronic device may not already be the scene in which the sound is located.

With reference to the first aspect, in one possible implementation manner, the identifying the first scene includes:

Under the condition that at least one image is acquired through a camera, inputting the at least one image into a trained scene recognition model to obtain the first scene; the trained scene recognition model is obtained by training sample images in various scenes, and the various scenes comprise the first scene.

In one possible implementation, the electronic device may perform scene recognition through at least one image captured by the camera. By way of example, the type of scene recognition model after training may be a neural network model, such as a convolutional neural network model, a deep neural network model, a recurrent neural network model, and the like, which is not limited by the present application.

It can be appreciated that in the process of training the scene recognition model to obtain the trained scene recognition model, sample images in various scenes can be collected for training. The various scenes can be daily scenes in life, such as a family scene, an office scene, a bus scene, a subway scene, a high-speed rail scene, an airport scene, a market scene, a coffee shop scene, a library scene and the like.

It can be appreciated that, in the case where at least one image is acquired by the camera, the accuracy of scene recognition based on the scene recognition model is high.

With reference to the first aspect, in one possible implementation manner, inputting the at least one image into a trained scene recognition model to obtain the first scene includes:

Inputting the at least one image into the trained scene recognition model to obtain a plurality of candidate scenes;

And taking a scene which is matched with at least one data of the positioning information of the electronic equipment, the network connection object of the electronic equipment and the moving speed of the electronic equipment as the first scene.

It can be appreciated that, after the at least one image is input into the trained scene recognition model to perform scene recognition, a plurality of candidate scenes and corresponding recognition probabilities can be obtained similarly to the event recognition model. Illustratively, after the at least one image is input into the trained scene recognition model, the recognition result of the trained scene recognition model considers that there is a 30% probability that the scene corresponding to the at least one image is considered to be scene a, a 18% probability that the scene corresponding to the at least one image is considered to be scene B, and a 8% probability that the scene corresponding to the at least one image is considered to be scene C.

In this embodiment, in the case where the above-described plurality of candidate scenes are obtained, the electronic device may determine the scene in combination with other data. For example, when the electronic device obtains the location information as the home address and obtains that the network connection object is home wifi, the electronic device may combine the two data, and may consider that the first scene is a home scene even if the probability of identifying the home scene in the plurality of candidate scenes is low.

Also illustratively, after the scene recognition by the above-described trained scene recognition model, the recognition probability of the subway scene is not greatly different from the recognition probability of the high-speed railway scene. Then, before the train does not depart, it can be determined as a high-speed rail scene or a subway scene by the positioning information. After the train starts, the moving speed of the electronic equipment can be obtained through the acceleration sensor, and because the running speed of the high-speed railway is greater than the running speed of the subway, whether the first scene is the high-speed railway scene or the subway scene can be determined based on the data of the speed of the electronic equipment.

With reference to the first aspect, in one possible implementation manner, the method further includes:

and under the condition that the at least one image is not acquired through the camera, determining the first scene according to at least one data of the positioning information of the electronic equipment, the network connection object of the electronic equipment and the moving speed of the electronic equipment.

It will be appreciated that the camera may be obscured, resulting in the electronic device not being able to capture images via the camera. Therefore, the first scene is determined according to at least one of the positioning information of the electronic device, the network connection object of the electronic device, and the moving speed of the electronic device without acquiring the at least one image by the camera.

Illustratively, the user may arrive at a subway station, or arrive at a company, or arrive at home, or arrive at a mall, or the like, and the first scene may be determined through the acquired positioning information. Also, for example, wifi in the home connected to the electronic device may consider the first scenario as a home scenario, and wifi in the company connected to the electronic device may consider the first scenario as a company scenario.

In a second aspect, an embodiment of the present application provides an audio data processing apparatus, including:

the acquisition unit is used for acquiring the audio data to be processed;

the identification unit is used for identifying a first scene, and the audio data to be processed are audio data generated in the first scene;

the identification unit is further used for inputting the audio data to be processed into an event identification model to carry out event identification, so as to obtain a plurality of candidate events;

And the determining unit is used for taking the candidate event belonging to the first scene from the plurality of candidate events as an event identification result of the audio data to be processed.

Optionally, in an embodiment of the present application, the step performed by the acquiring unit may be performed by a microphone or a communication module, where the communication module may be a mobile communication module or a wireless communication module; the steps performed by the identification unit and the determination unit may be performed by a processor.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory; the memory is coupled to the processor, the memory is for storing computer program code, the computer program code comprising computer instructions, the processor invoking the computer instructions to cause the method of the first aspect or any possible implementation of the first aspect to be performed.

In a fourth aspect, an embodiment of the present application provides a chip, including a logic circuit and an interface, where the logic circuit and the interface are coupled; the interface is for inputting and/or outputting code instructions and the logic circuitry is for executing the code instructions to cause the method of the first aspect or any possible implementation of the first aspect to be performed.

In a fifth aspect, embodiments of the present application disclose a computer program product comprising program instructions which, when executed by a processor, cause the method of the first aspect or any of the possible implementations of the first aspect to be performed.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a processor causes the method of the first aspect or any of the possible implementations of the first aspect to be performed.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of the present application for providing a multi-scene sound of the same type;

fig. 2 is a schematic diagram of a home scenario provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of event recognition based on audio signals according to an embodiment of the present application;

FIG. 4 is a schematic diagram of event recognition for an audio signal based on a scene according to an embodiment of the present application;

fig. 5 is a flowchart of an audio data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of audio feature extraction provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of event recognition in combination with audio and image provided by an embodiment of the present application;

FIG. 8 is a flowchart of another audio data processing method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of the positions of a first audio data and a second audio data according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a probability distribution provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of event recognition based on probability distribution according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application;

Fig. 13 is a block diagram of a software architecture of an electronic device 100 according to an embodiment of the present application.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure refers to and encompasses any or all possible combinations of one or more of the listed items.

The terms first and second and the like in the description, in the claims and in the drawings are used for distinguishing between different objects and not for describing a particular sequential order.

A hearing impaired user may be understood as a user with difficulty in binaural hearing, with no or inaudible sounds such as ambient and speech sounds. It can be appreciated that, for users with hearing impairment, they often cannot effectively sense an event occurring in the environment through sound, for example, after the washing machine finishes washing clothes, a prompt tone is generally output to remind the user of finishing the washing task and to dry the clothes in time, but for users with hearing impairment, they cannot timely make effective feedback to the prompt tone, and cannot timely learn that the clothes are washed.

In addition to the inability of a hearing impaired user to effectively perceive events occurring in the environment, in real life, a hearing impaired user may not be able to effectively perceive events occurring in the environment through sound for a certain period of time due to non-physiological factors. Illustratively, during the listening to music or watching video of a user wearing headphones (in particular noise reducing headphones), the sounds in the external environment are almost shielded, resulting in an event occurring in the sound-sensing environment that cannot be effectively passed through during the time the headphones are worn.

In the embodiment of the application, the users who cannot effectively pass through the events occurring in the sound perception environment due to physiological factors or non-physiological factors can be collectively called as sound obstruction users. In order to help the sound obstacle user to know the event occurring in the external environment, the electronic device can acquire the audio signal in the environment, input the audio signal into the event recognition model to perform event recognition, and then feed back the obtained event recognition result to the sound obstacle user in a non-voice mode. By the method, the sound obstruction user can be extremely convenient to learn about the events occurring in the environment, particularly the hearing obstruction user. Alternatively, identifying the corresponding event from the audio data may also be referred to as sound event detection (sound events detection).

In the embodiment of the present application, the event recognition model may be a convolutional neural network (convolutional neural networks, CNN), a deep neural network (deep neural networks, DNN), a recurrent neural network (recurrent neural network, RNN), or the like, which is not limited in the present application.

For example, the non-voice mode may be outputting text prompt information through popup windows, floating windows, and the like. Optionally, vibration can be matched to increase the prompting effect.

It will be appreciated that although the event recognition model can recognize the corresponding event from the input audio signal, many sounds in real life are similar, resulting in low accuracy in event recognition by the electronic device acquiring the audio signal.

For ease of understanding, referring to fig. 1, fig. 1 is a schematic diagram illustrating the appearance of the same type of sound in multiple scenes according to an embodiment of the present application.

Illustratively, assume that the electronic device receives a piece of audio signal that is a "stings" sound. In daily life, a "stings" sound may appear in many scenes, for example (a) in fig. 1, and the "stings" sound may come from swiping the entrance guard with the entrance guard card; also for example, in fig. 1 (b), the "stings" sound may come from the sound of the scanning of the courier sheet by the code scanner; also for example, in fig. 1 (c), the "stings" sound may be a warning sound or the like after the microwave oven task is completed.

For convenience of description, the sound of swiping the entrance guard card into the entrance guard is simply called an entrance guard sound; the sound of the code scanning gun for scanning the code is simply called as the sound of the code scanning gun; the sound output by the microwave oven after the task is completed is simply called the microwave oven sound.

It is because the same type of sound may appear in various scenes in daily life, so that the false recognition rate of the electronic device for recognizing the event according to the audio signal is high. For example, the electronic device is likely to misrecognize a "stings" sound that would have been a microwave oven as a "stings" sound of a door check card swipe, or as a "stings" sound of a code scanner scanning a code.

Based on the above-mentioned problems, the embodiments of the present application provide an audio data processing method and a related apparatus, where the method provided in the embodiments of the present application may be executed by an electronic device, and the electronic device may be any electronic device capable of executing the technical solution disclosed in the embodiments of the method of the present application. Alternatively, the electronic device may be any device capable of processing audio data, such as a mobile phone, a tablet computer, a wearable smart device, etc. It should be understood that the method embodiments of the present application may also be implemented by means of a processor executing computer program code. According to the application, the accuracy of the electronic equipment for identifying the event according to the audio signal can be improved.

For ease of understanding, please refer to fig. 2, 3 and 4 for exemplary purposes, wherein fig. 2 is a schematic diagram of a home scene provided by an embodiment of the present application, fig. 3 is a schematic diagram of event recognition based on an audio signal provided by an embodiment of the present application, and fig. 4 is a schematic diagram of event recognition based on a scene provided by an embodiment of the present application.

The schematic diagram shown in fig. 2 may be understood as a schematic diagram of a home scene, which may be understood as a kitchen in a home in particular. Illustratively, a microwave oven is placed in the kitchen, and after the microwave oven completes heating food, a "stings" alert tone is output to alert the user that the current food heating task is completed, such as shown at 201 in fig. 2.

It will be appreciated that, after receiving an audio signal including a "stings" sound (hereinafter referred to as an audio signal "stings"), the electronic device 202 in fig. 2 performs event recognition according to the audio signal "stings", that is, recognizes an event corresponding to the audio signal "stings". The solution shown in fig. 3 may be understood as other solutions, and the solution shown in fig. 4 may be understood as a solution provided by the embodiment of the present application.

For example, as shown in fig. 3, in other solutions, after receiving the audio signal "stings", the electronic device inputs the audio signal "stings" into the event recognition model to perform event recognition, so as to obtain an event recognition result set. The electronic device recognizes that the audio signal is 35% likely to be a gun sound, 20% likely to be a door check sound, and 15% likely to be a microwave sound, among other things. Because of the highest likelihood of a code-scanning gun sound, in other arrangements, the electronic device ultimately deems the audio signal "stings" as a code-scanning gun sound.

In the scheme provided by the application, for example, as shown in fig. 4, after receiving the audio signal "stings", the electronic device inputs the audio signal "stings" into the event recognition model corresponding to the home scene to perform event recognition under the home scene, so as to obtain an event recognition result set. It can be understood that, before performing event recognition under the home scene, the electronic device performs home scene recognition, that is, determines that the electronic device is located in the home scene, and then performs event recognition on the audio signal by using an event recognition model corresponding to the home scene.

Illustratively, in this solution, the electronic device recognizes that the audio signal "stings" is 40% likely to be a microwave oven sound, 10% likely to be an access card sound, and 5% likely to be a code gun sound. Since the likelihood of microwave oven sound is highest, in this solution, the electronic device eventually considers the above-mentioned audio signal "stings" as microwave oven sound.

Finally, in this embodiment, as shown in fig. 2, the electronic device 202 may output the text content "the current task of the microwave oven is completed" in the form of a popup window after identifying the sound of the microwave oven, instead of misidentifying the audio signal as the sound of the scanning gun completing the scanning like other embodiments.

The above general description of the present application is provided, and the following description of the specific flow of the method provided by the embodiment of the present application is provided. Referring to fig. 5, fig. 5 is a flowchart illustrating an audio data processing method according to an embodiment of the application. As shown in fig. 5, the method includes:

501: and acquiring audio data to be processed and identifying key scenes.

In the embodiment of the present application, the audio data to be processed may be understood as audio data from which an event needs to be identified. In one possible implementation, an electronic device may include a sound collection module. And acquiring sound through a sound acquisition module to obtain the audio data to be processed. For example, the sound collection module may be one or more microphones, and in the case where the number of microphones is plural, the plural microphones may be referred to as a microphone array or array microphone.

It can be understood that the sound collected by the microphone is an audio analog signal, and the audio data to be processed can be obtained through sampling, quantization and encoding.

In another possible implementation, the electronic device may be communicatively connected to other devices, and audio data obtained from the other devices through the communication connection is used as audio data to be processed.

In this step, the key scene may be understood as a scene corresponding to the audio data to be processed identified by the electronic device. In the embodiment of the application, the electronic equipment can identify the scene in various modes. The electronic device may recognize a scene by data acquired by at least one sensor (sensor), wherein the data may be audio data acquired by a sound sensor, image data acquired by an image sensor, position data acquired by a position sensor, motion data acquired by an acceleration sensor, and the like.

In one possible implementation, the electronic device may identify a scene according to the sensor data in real time, and in the case that the audio data to be processed is acquired at the time a, the latest scene determined before the time a is used as the key scene. It will be appreciated that since a scene corresponds to a certain geographical range in real-world situations, the scene is generally not prone to abrupt changes, i.e. the electronic device determines a scene that is valid for a period of time. For example, when the user goes home from work, the current scene determined by the electronic device is a home scene, and even though the user may go out again, the validity of the home scene will last for a period of time. Therefore, although the timing of determining the key scene is not completely synchronized with the timing of acquiring the audio data to be processed, the audio data to be processed acquired at the above-described timing a can be considered as audio data obtained from the sound acquired in the key scene.

In another possible implementation, the electronic device may first obtain the audio data to be processed and then retrieve the sensor data to identify the scene. It will be appreciated that, similar to the foregoing description, the scene is generally not prone to abrupt changes, and thus, the scene determined after the electronic device acquires the audio data to be processed may be understood as the above-mentioned key scene.

That is, although the timing of determining the key scene is not completely synchronized with the timing of acquiring the audio data to be processed, the above-described audio data to be processed may be considered as audio data obtained from the sound acquired from the key scene.

502: Inputting the audio data to be processed into an event recognition model corresponding to the key scene to perform event recognition, so as to obtain an event recognition result of the audio data to be processed; the event recognition model corresponding to the key scene is obtained by training by collecting audio sample data in the key scene.

In the embodiment of the application, the event recognition model itself can be a neural network model, for example, can be CNN, DNN, RNN, etc., and the event recognition model corresponding to the key scene can be obtained by collecting audio sample data from the key scene and training any event recognition model.

It can be understood that after the event recognition model is trained by using the audio sample data collected in the key scene, the event recognition model obtained after training changes in model parameters relative to the event recognition model before training, so that the accuracy rate of event recognition of the event recognition model corresponding to the key scene on the audio data to be processed in the key scene can be improved.

For ease of understanding, fig. 3 and 4 are exemplarily multiplexed, assuming that the audio data to be processed corresponds to a "stings" sound output when the microwave oven task is completed. The electronic equipment adopts other schemes to carry out event recognition on the audio data to be processed, and in the obtained event recognition result, the probability of the sound of the microwave oven is 10 percent, namely, the electronic equipment has 10 percent of probability that the audio data to be processed is considered to be the sound of the microwave oven correspondingly.

In the scheme, the electronic equipment determines that the key scene is the home scene, carries out event recognition based on an event recognition model corresponding to the home scene, and in the obtained event recognition result, the probability of the sound of the microwave oven is improved from 10% to 40% due to the limitation of the range of the scene. That is, in this scheme, the electronic device has a probability of 40% that the audio data to be processed corresponds to the microwave oven sound. As can be seen by comparing fig. 3 and fig. 4, the present solution can improve the accuracy of the electronic device in identifying events from audio data.

In the embodiment of the present application, for example, the electronic device may first extract the audio features before acquiring the audio data to be processed and inputting the audio data to be processed into the event recognition model corresponding to the key scene for event recognition. The audio feature extraction may be understood as extracting an identifiable component from the audio signal, so as to facilitate the event recognition by a subsequent event recognition model.

For ease of understanding, referring to fig. 6, fig. 6 is a schematic diagram illustrating audio feature extraction according to an embodiment of the present application.

The electronic device collects an audio signal to be processed through a microphone, and the audio signal to be processed is an audio analog signal. Then, the audio analog signal is converted into an electric signal, and the electric signal is sampled, quantized and encoded to obtain an audio digital signal.

As shown in fig. 6, the audio digital signal is first framed to obtain an audio digital signal in units of frames. Illustratively, each frame is 85ms long, and assuming a sampling rate of 16kHz, there are 1360 (16000×0.085) sample points in a frame of the audio digital signal.

Each frame of the audio digital signal is then windowed, i.e., the window function is multiplied by each frame of the audio digital signal, thereby forming a windowed audio digital signal. The window function may be Hamming (Hamming) window, hanning (Hanning) window, blackman (Blackman) window, and the like, and the windowing may enable the time domain signal to better meet the periodicity requirement of fast fourier transform (fast fourier transform, FFT) processing, so as to reduce leakage.

Finally, mel-frequency cepstral coefficient (Mel frequency cepstrum coefficient, MFCC) features are extracted from the windowed audio digital signal, outputting an audio feature vector. Illustratively, the windowed audio digital signal may be subjected to FFT to obtain a corresponding spectrum, and then the spectrum is passed through a Mel filter bank to obtain a Mel spectrum; and finally, carrying out cepstrum analysis on the Mel frequency spectrum to obtain Mel Frequency Cepstrum Coefficient (MFCC), wherein the MFCC can be understood as audio characteristics.

Alternatively, the electronic device may perform event recognition with one or more frames as processing units, and may exemplarily use 10 frames as processing units, that is, recognize one event every 10 frames of audio data.

The processing procedure shown in fig. 6 may be a procedure of inputting the audio data to be processed into an event recognition model corresponding to the key scene by the electronic device to perform event recognition, or may be a processing procedure of acquiring audio sample data to train the event recognition model to obtain a trained event recognition model.

For example, a general scene in daily life, such as a home scene, an office scene, a bus scene, a subway scene, a mall scene, a parking lot scene, and the like, may be planned in advance. Then, respectively acquiring audio sample data from each scene, respectively inputting the audio sample data into an event recognition model for training, and then obtaining the event recognition model corresponding to each scene. Wherein a plurality of audio sample data may be collected in each scene.

It will be appreciated that the processing of the audio sample data may be performed as shown in fig. 6 before the audio sample data is input to the event recognition model for training, that is, framing, windowing, extracting MFCC features, and inputting the obtained audio feature vector to the event recognition model for training are sequentially performed before the audio sample data is input to the event recognition model for training.

It can be understood that the living scenes of different users are often different, for example, office workers mainly come and go between a company and a family, leisure time can come in and go out of entertainment places such as a mall, and daily vehicles can be subways, buses, private cars and the like; while hearing impaired users may be less likely to occur in casinos such as shops. Therefore, after the authorization permission of the user is obtained, the electronic equipment can further train the initial event recognition model according to the scene collected audio sample data appearing in the daily life of the user, and optimize the parameters of the event recognition model so as to further improve the accuracy of the event recognition model in recognizing the event. It should be appreciated that the audio sample data may be preprocessed as shown in fig. 6 each time the model is trained by collecting the audio sample data.

The content related to the event recognition model in the embodiment of the present application is described above, and then the scene recognition in the embodiment of the present application is described below.

In one possible implementation, the electronic device may acquire at least one image, and image-identify the at least one image, thereby identifying the scene. Alternatively, the above approach may also be referred to as image scene recognition.

For example, image scene recognition may be performed by means of deep learning techniques. For example, the at least one image may be input into a deep learning model to implement scene recognition, where the deep learning model may be a Places-CNN, a DeCAF network model, a multi-resolution CNN (multi resolution CNN), and the application is not limited thereto. For ease of understanding, the above model for image scene recognition is referred to as an image scene recognition model in the embodiment of the present application.

Thus, in some embodiments, as shown in fig. 7, fig. 7 is a schematic diagram of event recognition by combining an image with audio according to an embodiment of the present application.

For example, the electronic device may include a camera through which at least one image is acquired. Or the electronic device may establish a communication connection with other devices, and use the image acquired through the communication connection as the at least one image. For example, if a home camera is installed in a user's home, then the user's electronic device may establish a communication connection with the home camera to receive images captured by the home camera through the communication connection if the user remains at home. Thus, alternatively, the at least one image may be at least one image in a video stream.

In this embodiment, the electronic device may input the at least one image into the image scene recognition model to perform scene recognition, so as to obtain a scene corresponding to the at least one image. For convenience of description and understanding, a scene corresponding to the at least one image is referred to as a key scene. Thus, the above-described process may be referred to as critical scene recognition.

After the key scene is obtained, as shown in fig. 7, the obtained audio data to be processed is input into an event recognition model corresponding to the key scene to perform event recognition, so as to obtain an event corresponding to the audio data to be processed. The audio data to be processed may be obtained through a microphone in the electronic device. The event recognition model corresponding to the key scene may be understood as an event recognition model obtained by training according to the audio sample data in the key scene, and specifically, reference may be made to the related description of fig. 6, which is not repeated herein.

It can be understood that the embodiment of the present application is not limited to the order of acquiring at least one image and audio data to be processed as shown in fig. 7, that is, the at least one image may be acquired first to identify a key scene to obtain a key scene, and then the audio data to be processed may be acquired to identify an event. Or acquiring the audio data to be processed, acquiring the at least one image to identify a key scene, and identifying an event of the audio data to be processed after the key scene is obtained.

It can be appreciated that during the process of using the electronic device, the user may not be able to acquire the at least one image for performing the key scene recognition, for example, the user is in an outdoor activity, and the camera of the electronic device is blocked. In the embodiment of the application, besides scene recognition according to the image, other sensor data can be used for recognizing the scene.

For example, the location information of the electronic device may be acquired by a location sensor, and the scene may be identified by the location information. For example, the user can reach a subway station, or reach a company, or reach home, or reach a certain market, etc., and the corresponding scene can be determined through the position information acquired by the position sensor.

It can be understood that the electronic device may periodically acquire the location information to perform scene recognition, or may acquire the location information to perform scene recognition after acquiring the audio data to be processed.

Also by way of example, scene recognition may be performed through a network connection of the wireless network sensor, wherein the network connection may be a wifi connection. For example, wifi in the home connected to the electronic device may consider the current scene as a home scene, and wifi in the company connected to the electronic device may consider the current scene as a company scene.

The scene recognition can be performed through the speed obtained by the acceleration sensor, for example, the speed of the user on the high-speed rail is obviously different from the running speed of vehicles such as buses, private cars, subways and the like in daily life, and the current scene can be considered to be the high-speed rail when the speed obtained by the acceleration sensor is located in the high-speed rail running speed interval.

Also by way of example, scene recognition may be performed by audio data collected by a sound sensor. For example, audio sample data under different scenes can be acquired, then scene labels are attached to the audio sample data, and then the audio sample data is input into a neural network model for training. The neural network model may be DNN, CNN, RNN, etc., which is not limited in the present application; the scene tag may be, for example, a home scene, a public transportation scene, a subway scene, a market scene, or the like. It will be appreciated that the trained neural network model may be used to scene classify the input audio data, i.e. to scene identify based on the input audio data.

In the above embodiment, the scene is first identified, and then the event identification is performed on the audio data to be processed based on the event identification model corresponding to the scene. In addition, event recognition can be performed through probability distribution between the event and the scene.

It will be appreciated that the variety of scenes and events in daily life is great, multiple events may occur in one scene, and the probability of the same event occurring in different scenes is different. By way of example, the sound of an escalator is highly probable to occur in subway scenes, mall scenes, and airport scenes, while it is unlikely to occur in home scenes (without excluding the escalator installed in some cells). Also, by way of example, microwave oven sounds are highly likely to occur in a home scene, or in a mall scene, but are unlikely to occur in a subway scene (without excluding the case where a subway station has a convenience store, etc.).

Therefore, the probability distribution model can be obtained by counting the probability of different events in different scenes, and event identification is performed based on the probability distribution model. Referring to fig. 8, fig. 8 is a flowchart of another audio data processing method according to an embodiment of the present application, where the method shown in fig. 8 may be performed by an electronic device, and the method includes:

801: scenes and events are defined, one of the scenes including at least one of the events.

In this step, various scenes and events occurring in life can be counted. Illustratively, the above-mentioned scene may be a home scene, an office scene, a bus scene, a subway scene, a high-speed rail scene, an airport scene, a mall scene, a library scene, etc., and the event may be a microwave oven sound, a washing machine sound, an induction cooker sound, an escalator sound, a code-sweeping gun sound, a shielding door sound, an entrance guard card sound, a whistle, etc.

It will be appreciated that an event must occur in a certain scenario, and thus, in embodiments of the present application, one such scenario includes at least one such event. For example, there may be a microwave oven sound and/or a washing machine sound in a home scenario, an escalator sound and/or a door screening sound in a mall scenario, etc.

802: First audio sample data for classifying the scene is collected, and second audio sample data for classifying the event is collected.

803: Inputting the first audio sample data and the second audio sample data into a model to be trained for training, and obtaining a trained model; wherein the first audio sample data includes a scene tag and the second audio sample data includes an event tag.

In the embodiment of the present application, the first audio sample data may be understood as sample data for classifying a scene after model training, that is, identifying the scene; the second audio data may be understood as sample data for classifying the event after model training, i.e. identifying the event.

It can be understood that after the training of the model to be trained is completed through the first audio sample data and the second audio sample data, the trained model can be used for identifying a scene corresponding to the audio data and identifying an event corresponding to the audio data. For example, the audio data identifying the scene may be different audio data than the audio data identifying the event.

In this step, the model to be trained may be a neural network model, for example, RNN, DNN or CNN, which is not limited in the present application. Optionally, before the first audio sample data and the second audio sample data are input into the model to be trained for training, the audio sample data may be processed based on the processing flow shown in fig. 7, which is not described herein.

In this step, the scene tag may be understood as a tag for indicating that the audio sample data is for scene classification, and the event tag may be understood as a tag for indicating that the audio sample data is for event classification, which is not limited by the present application with respect to a specific form of the tag.

804: And acquiring audio data to be processed.

In this step, the audio data to be processed may be understood as a section of audio data collected in a real scene, and an event needs to be identified from the audio data to be processed. The above-mentioned one piece of audio data may include multi-frame audio data, and other descriptions of this step may be taken to refer to the foregoing step 501, which is not repeated herein.

805: And identifying a key scene from first audio data in the audio data to be processed based on the trained model, identifying a plurality of reference events and first reference probabilities corresponding to each reference event from second audio data in the audio data to be processed, wherein the first audio data and the second audio data are audio data in different time positions in the audio data to be processed, the time interval between the first audio data and the second audio data is smaller than a threshold value A, and the first reference probabilities of the reference events are probabilities that the trained model identifies the first audio data as the reference events.

It should be understood that the first audio data and the second audio data are to be understood as audio data at different time positions in the audio data to be processed. For example, in the case of taking a frame as a processing unit, the first audio data may be 5 th frame audio data among the audio data to be processed, and the second audio data may be understood as 6 th frame audio data among the audio data to be processed. The duration occupied by each frame of audio data may be 5 seconds, 10 seconds, etc., which is not limited in the present application.

In this step, the order of the first audio data and the second audio data in the audio data to be processed is not limited, that is, the first audio data may precede the second audio data, and the first audio data may follow the second audio data. Illustratively, in the case of taking a frame as a processing unit, the first audio data may be 8 th frame audio data among the audio data to be processed, and the second audio data may be understood as 7 th frame audio data among the audio data to be processed.

In this step, the time interval between the first audio data and the second audio data is smaller than a threshold a, where the threshold a may be a certain duration smaller than 3 frames according to the actual situation of the device. Illustratively, the frame length of each frame is 5 seconds, and the threshold a may be 10 seconds, that is, the time interval between the first audio data and the second audio data is less than 10 seconds.

For ease of understanding, referring to fig. 9, fig. 9 is a schematic diagram illustrating positions of first audio data and second audio data according to an embodiment of the present application.

As shown in (a) of fig. 9, the first audio data may be two adjacent frames of audio data, wherein the first audio data is located before the second audio data, the first audio data is used for scene recognition, and the second audio data is used for event recognition.

As shown in (b) and (c) of fig. 9, the first audio data may be spaced one frame length before the second audio data. For example (b) in fig. 9, the first audio data may be located in front of the second audio data; for example (c) in fig. 9, the first audio data may be located behind the second audio data. Likewise, the first audio data is used for scene recognition and the second audio data is used for event recognition.

It will be appreciated that since the time interval between the first audio data and the second audio data is small and the scene is not abrupt in real time in reality, it can be considered that the event identified by the second audio data should be an event in the scene identified by the first audio data. It should also be understood that, in the embodiment of the present application, the time sequence of the first audio data and the second audio data is not limited, and even if an event is identified by the second audio data and then a scene is identified by the first audio data, the event identified by the second audio data may be considered to be an event in the scene identified by the first audio data because the time interval between the first audio data and the second audio data is small.

It can be understood that when the first audio data is input into the trained model to perform scene recognition, the obtained recognition result may be a plurality of candidate scenes and the recognition probability of each candidate scene, for example, in the recognition result, the probability that the first audio data corresponds to the scene a is 15%, the probability that the first audio data corresponds to the scene B is 8%, the probability that the first audio data corresponds to the scene C is 16%, and the above 15%, 8% and 16% may be understood as the recognition probabilities. In this step, the candidate scene with the highest recognition probability may be used as the key scene.

Likewise, when the second audio data is input into the trained model for event recognition, the result may be a plurality of reference events and a first reference probability for each reference event. For example, in the recognition result, the probability that the second audio data corresponds to the event a is 30%, the probability that the second audio data corresponds to the event B is 10%, the probability that the second audio data corresponds to the event C is 18%, and the above 30%, 10% and 18% can be understood as the first reference probability.

806: Multiplying the first reference probability of each reference event in the plurality of reference events by the second reference probability to obtain a third reference probability corresponding to each reference event in the plurality of reference events; the second reference probability is a probability that the reference event obtained by statistics appears in the key scene.

It will be appreciated that the first reference probability is a probability obtained by event recognition of the second audio data by the trained model, and the second reference probability is obtained by statistics. The second reference probability can be obtained by counting the event situations occurring in different areas, for example, counting whether an escalator exists in a mall in the area A, a microwave oven exists, a code scanning gun exists or not, and the like, so that the probability of the escalator sound occurring in the mall scene, the probability of the microwave oven sound occurring in the mall scene and the probability of the code scanning gun occurring in the mall scene are obtained.

For each reference event, the result obtained by multiplying the first reference probability corresponding to the reference event by the second reference probability corresponding to the reference event is taken as the final probability corresponding to the reference event, namely the third reference probability.

807: And taking a first reference event in the plurality of reference events as an event identification result of the second frame of audio data, wherein the first reference event is a reference event with the maximum corresponding third reference probability in the plurality of reference events.

In order to facilitate understanding of the method shown in fig. 8, a scene is taken as a home scene, a subway scene and a bus scene; events are microwave oven sound, door sound, and card reader sound, and model DNN is described as an example. The method specifically comprises the following steps:

(1) Defining scenes and events

The scenes are home scenes, subway scenes and bus scenes. Can be expressed by way of example as:

scene= { home scene, subway scene, bus scene }.

The events are microwave oven sound, shielding door sound, and card swiping machine sound. Can be expressed by way of example as:

Event = { microwave oven sound, shield door sound, swipe card reader sound }.

(2) Collecting audio sample data

Collecting audio sample data for classifying family scenes, audio sample data for classifying subway scenes and audio sample data for classifying bus scenes; and collecting audio sample data of the sound of the microwave oven in the home scene, audio sample data of the sound of the shielding door in the subway scene and audio sample data of the sound of the card swiping machine in the bus scene.

(3) Model training

Labeling the audio sample data in the step (2), namely labeling the audio sample data for classifying family scenes, the audio sample data for classifying subway scenes and the audio sample data for classifying bus scenes respectively; the method comprises the steps of respectively marking event labels on audio sample data of microwave oven sounds in a home scene, audio sample data of shielding door sounds in a subway scene and audio sample data of card swiping machine sounds in a bus scene, inputting DNN for training, and obtaining trained DNN.

(4) Establishing a probability distribution model

For ease of understanding, referring to fig. 10, fig. 10 is a schematic diagram of a probability distribution provided by an embodiment of the present application.

The horizontal direction as shown in fig. 10 represents an event, the vertical direction represents a scene, and the probability of occurrence of an event in the scene may be represented by transition probability, i.e., P (event|scene). Illustratively, probability 0.7 of the second row and the second column in fig. 10 indicates that the probability of occurrence of the microwave oven sound in the home scene is 0.7, and may be expressed as P (microwave oven sound |home scene) =0.7. Also by way of example, probability 1 of the third row and the third column in fig. 10 indicates that the probability of occurrence of the masking gate sound in the subway scene is 1, and may be expressed as P (masking gate sound |subway scene) =0.7.

(5) Event identification

In this embodiment, the probability P (event) =p (event|scene) ×p (scene) ×p _m of the event, where P (scene) is 1; p _m is the probability that the model identified the event, e.g., the first reference probability.

Thus, the final probability of an event P (event) =p (event|scene) ×p _m.

Referring to fig. 11, fig. 11 is a schematic diagram illustrating event recognition based on probability distribution according to an embodiment of the present application.

As shown in fig. 11, after the audio data to be processed is obtained, the DNN recognition scene after the audio data segment a in the audio data to be processed is input to the training is a home scene, and the DNN recognition event after the audio data segment B in the audio data to be processed is input to the training is a door-masking sound, a microwave oven sound, and a card swiping sound, wherein the probability of each event is 20%, 10%, and 15%, respectively.

It will be appreciated that in other arrangements, the masking gate sound will be considered to correspond to the masking gate sound with the highest probability in recognition of the DNN after training.

In this scheme, based on the probability distribution in fig. 10, that is, the probability of occurrence of the microwave oven sound in the home scene is 0.7, the probability of occurrence of the door-to-door sound in the home scene is 0.1, the probability P (door-to-door sound) of the audio data segment B corresponding to the door-to-door sound is 0, the probability P (microwave oven sound) of the audio data segment B corresponding to the microwave oven sound is 0.07, and the probability P (door-to-door sound) of the audio data segment B corresponding to the door-to-door sound is 0.015. As shown in fig. 11, since P (microwave oven sound) is the largest, the present solution eventually considers that the audio data segment B corresponds to microwave oven sound, rather than the masking door sound considered by the trained DNN model.

It should be noted that, in the embodiment of the present application, the number before the step should be understood as the identification of the step, so that on one hand, description of the scheme is facilitated, and on the other hand, increasing the readability is convenient for the reader to understand the scheme, and should not be understood as the limitation of the execution sequence of the step.

The method provided by the embodiment of the application is introduced above, and the electronic equipment related to the embodiment of the application is introduced next.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an electronic device 100 according to an embodiment of the application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an acceleration sensor 180C, a fingerprint sensor 180D, a temperature sensor 180E, a touch sensor 180F, an ambient light sensor 180G, and the like.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown in FIG. 12, or may combine certain components, or split certain components, or a different arrangement of components. The components shown in fig. 12 may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SERIAL DATA LINE, SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180F, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180F through an I2C interface, such that the processor 110 communicates with the touch sensor 180F through an I2C bus interface to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (CAMERA SERIAL INTERFACE, CSI), display serial interfaces (DISPLAY SERIAL INTERFACE, DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the photographing functions of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays pictures or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques can include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (GENERAL PACKET radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation SATELLITE SYSTEM, GLONASS), a beidou satellite navigation system (beidou navigation SATELLITE SYSTEM, BDS), a quasi zenith satellite system (quasi-zenith SATELLITE SYSTEM, QZSS) and/or a satellite based augmentation system (SATELLITE BASED AUGMENTATION SYSTEMS, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display pictures, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

In some embodiments, the display 194 may display the event recognition result of the audio data to be processed, for example, may be a text prompt output in the form of a box.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, a picture or video playing function, etc.) required for at least one function of the operating system. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

In the embodiment of the present application, the internal memory 121 may include the first storage unit and the second storage unit, and the first storage unit may be referred to as a buffer.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a horn, is used to convert an audio electrical signal into a sound signal.

A receiver 170B, also called an earpiece, is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also known as a microphone or microphone, is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

In some embodiments, the sound sensor may be the microphone 170C, and the electronic device may collect the audio data to be processed through the microphone 170C, for example.

The earphone interface 170D is used to connect a wired earphone. The earphone interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A.

The electronic device 100 may determine the strength of the pressure based on the change in capacitance. Illustratively, when a touch operation is applied to the display 194, the electronic apparatus 100 detects the touch operation intensity from the pressure sensor 180A. Also for example, the electronic device 100 may also calculate the position of the touch according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The acceleration sensor 180C may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180D is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180E is used to detect temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180E. For example, when the temperature reported by temperature sensor 180E exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180E in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing the electronic device 100 to be abnormally shut down. In other embodiments, when the temperature is below a further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperatures.

The touch sensor 180F is also referred to as a "touch panel". The touch sensor 180F may be disposed on the display 194, and the touch sensor 180F and the display 194 form a touch screen, which is also referred to as a "touch screen". The touch sensor 180F is used to detect a touch operation acting thereon or thereabout. The touch sensor 180F may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180F may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195, or removed from the SIM card interface 195 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 195 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, i.e.: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

In some embodiments, the mobile communication module 150 or the wireless communication module 160 may receive audio data sent by other electronic devices, and the processor 110 may invoke the computer instructions stored in the internal memory 121 to use the audio data sent by the other electronic devices as the audio data to be processed.

In other embodiments, the processor 110 may invoke computer instructions stored in the internal memory 121 to implement the audio data processing method provided in the embodiments of the present application.

Illustratively, the processor 110 may invoke computer instructions stored in the internal memory 121 to acquire at least one image acquired by the camera 193, acquire acceleration data detected by the acceleration sensor, acquire a network connection condition (such as wifi connection condition) in the mobile communication module 150, acquire positioning information determined based on the mobile communication module 150, and so on, perform scene recognition, and obtain a scene recognition result.

Also for example, the processor 110 may invoke computer instructions stored in the internal memory 121, acquire audio data to be processed through the microphone 170C, or the mobile communication module 150, or the wireless communication module 160, and then obtain event recognition results based on the scene recognition results.

It is appreciated that the software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the application takes an android system with a layered architecture as an example, and illustrates a software structure of the electronic device 100.

Referring to fig. 13, fig. 13 is a block diagram illustrating a software structure of an electronic device 100 according to an embodiment of the application.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system may be divided into four layers, from top to bottom, an application layer, an application framework layer, a system runtime layer, and a kernel layer. The description of the above layers is as follows:

first, the application layer may include a series of application packages. By way of example, application packages at the application layer may include applications for cameras, gallery, calendar, talk, map, navigation, browser, bluetooth, music, video, and short messages.

Second, the application framework layer may provide an application programming interface (application programming interface, API) and programming framework for applications in the application layer. The application framework layer may include some predefined functions.

Illustratively, the application framework layers may include an activity manager (ACTIVITY MANAGER), a window manager (window manager), a content provider (content provider), a view system (VIEW SYSTEM), a phone manager (telephony manager), a resource manager (resource manager), a notification manager (notification manager), and so on. Wherein:

the activity manager may be used to manage individual application lifecycle and, typically, navigation rollback functions.

The window manager may be used to manage window programs. Illustratively, the window manager may obtain the display screen size of the electronic device 100, lock the screen, intercept the screen, determine if a status bar exists, and so forth.

The content provider may be used to store and retrieve data and make the data accessible to applications so that data may be accessed or shared between different applications. By way of example, the data may include video, images, audio, calls made and received, browsing history and bookmarks, and phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The phone manager is used to provide communication functions of the electronic device 100, such as management of call status (including making a call, hanging up a phone, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Illustratively, a notification manager may be used to inform that the download is complete, a message reminder, and so forth. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

Further, the system runtime layer may include a system library and an android runtime (Android runtime). Wherein:

the android runtime includes a core library and virtual machines. And the android running time is responsible for scheduling and managing an android system. The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android. The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

A system library can be understood as the support of an application framework, which is an important ligament connecting the application framework layer and the kernel layer. The system layer may include a plurality of functional modules, and may include, for example, a surface manager (surface manager), a media library (media library), a three-dimensional graphics processing library (e.g., openGL ES), a two-dimensional graphics engine (e.g., SGL), and the like. Wherein:

The surface manager may be used to manage the display subsystem, such as in the case of multiple applications executed by the electronic device 100, and is responsible for managing interactions between display and access operations. The surface manager may also be used to provide a fusion of two-dimensional and three-dimensional layers for multiple applications.

The media library may support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

A two-dimensional graphics engine may be understood as a drawing engine for two-dimensional drawing.

Finally, the kernel layer may be understood as an abstraction layer between hardware and software. The kernel layer may include security, memory management, process management, power management, network protocol management, and drive management. Illustratively, the kernel layer may include a display driver, a camera driver, an audio driver, a sensor driver, and the like.

In some embodiments, the application layer may further include an audio data processing module, where the audio data processing module is configured to implement the audio data processing method provided by the embodiment of the present application.

It will be appreciated that in other embodiments, the audio data processing module may also be at other levels of the hierarchical architecture, such as a system level, etc., which is not limited herein.

The present application also provides a computer readable storage medium having computer code stored therein, which when run on a computer causes the computer to perform the method of the above embodiments.

The application also provides a computer program product comprising computer code or a computer program which, when run on a computer, causes the method in the above embodiments to be performed.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to determination …" or "in response to detection …" depending on the context. Similarly, the phrase "at the time of determination …" or "if detected (a stated condition or event)" may be interpreted to mean "if determined …" or "in response to determination …" or "at the time of detection (a stated condition or event)" or "in response to detection (a stated condition or event)" depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

It should be further understood that the foregoing is only specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of audio data processing, the method comprising:

Inputting the audio data to be processed into an event recognition model for event recognition to obtain a plurality of candidate events;

2. The method according to claim 1, wherein said identifying a candidate event belonging to the first scene among the plurality of candidate events as an event identification result of the audio data to be processed includes:

Acquiring a first probability of each candidate event occurring in the first scene, wherein the first probability is obtained by counting the number of times each candidate event occurs in the first scene;

3. The method according to claim 1, wherein the event recognition model is a model obtained by training an event recognition model to be trained by collecting audio sample data in the first scene; the step of using the candidate event belonging to the first scene as the event identification result of the audio data to be processed in the plurality of candidate events includes:

4. A method according to any of claims 1-3, characterized in that the time interval between the instant at which the audio data to be processed is acquired and the instant at which the first scene is identified is less than or equal to a first threshold value.

5. The method of any of claims 1-4, wherein the identifying the first scene comprises:

6. The method of claim 5, wherein inputting the at least one image into a trained scene recognition model to obtain the first scene comprises:

7. The method according to claim 5 or 6, characterized in that the method further comprises:

8. An electronic device comprising a processor, a memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions such that the method of any of claims 1-7 is performed.

9. A chip comprising logic circuitry and an interface, the logic circuitry and interface coupled; the interface being for inputting and/or outputting code instructions, the logic circuitry being for executing the code instructions to cause the method of any of claims 1-7 to be performed.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the method according to any of claims 1-7 to be performed.