CN117334207A

CN117334207A - Sound processing method and electronic equipment

Info

Publication number: CN117334207A
Application number: CN202210727150.7A
Authority: CN
Inventors: 徐波; 张超; 马晓慧; 余平; 张丽梅; 冯素梅; 陈鹏; 周秀敏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2024-01-02
Also published as: WO2023246563A9; WO2023246563A1

Abstract

A sound processing method comprising: acquiring target parameters, wherein the target parameters comprise environment information associated with target equipment and/or state information of a user; processing the original audio data according to the target parameters to obtain target audio data, wherein the target audio data is matched with the environment information and/or the state information; outputting the target audio data. In this way, the original audio data is processed according to the target parameters, so that the original audio data can be matched with the target parameters, and the audio data to be played, which is adaptive to the current environment or the state of the current user, is constructed, so that the audio data to be played can be fused with the current environment or the state of the current user, and the user experience is improved.

Description

Sound processing method and electronic equipment

Technical Field

The present disclosure relates to the field of terminal technologies, and in particular, to a sound processing method and an electronic device.

Background

At present, electronic devices such as mobile phones and sound boxes with audio playing function gradually enter into the life of people. With this type of electronic device, the user can play his desired audio data anytime and anywhere. For example, a user may play his favorite music in a home using a speaker, may navigate or play music in a vehicle using a mobile phone, may navigate or play music in a vehicle using an in-vehicle terminal provided in the vehicle, or the like. But at present, the electronic equipment can only play the audio data which can be played in original juice and flavor in the process of playing the audio data, so that the user experience is poor.

Disclosure of Invention

The application provides a sound processing method, electronic equipment, a computer storage medium and a computer program product, which can construct audio data to be played, which is adaptive to the state of the current environment or the current user, so that the audio data to be played can be fused with the state of the current environment or the current user, and the user experience is improved.

In a first aspect, the present application provides a sound processing method, which may include: acquiring environment information associated with target equipment, wherein the environment information comprises environment data of an area where the target equipment is located; according to the environment data, N sound objects associated with the environment data are determined, wherein N is more than or equal to 1; acquiring white noise corresponding to each sound object to obtain N pieces of audio data, wherein each piece of audio data is associated with one sound object; synthesizing the N pieces of audio data to obtain target audio data, wherein the target audio data are matched with the environment information; outputting the target audio data. In this way, since the N sound objects are associated with the environmental data of the area where the target device is located, the target audio data obtained by the white noise corresponding to the N sound objects is also matched with the environmental data of the area where the target device is located, so that the user can have experience in the physical environment when listening to the target audio data, thereby having the feeling of being personally on the scene, and improving the user experience.

In some embodiments, the method may be applied in the scenario described below in fig. 1. In this case, the target device may be a vehicle or an electronic device in the vehicle. The target device may be a device integrated in the vehicle, such as an in-vehicle terminal, or may be a device separate from the vehicle, such as a driver's mobile phone, for example. In addition, the environmental data may include one or more of environmental images, environmental sounds, weather information, or season information, etc.

In some embodiments, the N sound objects may be sound objects identified based on the environmental data, or may be sound objects obtained by filtering sound objects identified based on the environmental data, for example, sound objects obtained by removing some of the sound objects, or sound objects obtained by adding some new sound objects, etc.

In one possible implementation manner, white noise corresponding to each sound object is obtained, so as to obtain N pieces of audio data, which specifically includes: and inquiring an atomic database based on the N sound objects to obtain N audio data, wherein the atomic database is configured with the audio data of each single object in a specific period of time. By way of example, audio data of a plurality of objects in the atom database may be randomly combined or combined according to a preset rule, and audio data of a certain duration may be obtained. Illustratively, the atom database may include: audio data of water flow, audio data of cicada sound, audio data of vegetation and the like. For example, the audio data of white noise in the atomic database may be configured in advance in the vehicle, or acquired from a server in real time, or the like.

In one possible implementation, the environmental data includes environmental sounds. The method for obtaining the white noise corresponding to each sound object to obtain N audio data specifically comprises the following steps: extracting audio data of M sound objects from the environmental sound to obtain M audio data, wherein M is more than or equal to 0 and less than or equal to N; and when M is less than N, inquiring an atomic database based on the rest sound objects in the N sound objects to obtain (N-M) audio data, wherein the atomic database is configured with the audio data of each single object in a specific period of time. For example, when the audio data of the sound object extracted from the environmental sound does not meet the requirement, the audio data may be discarded, and the audio data corresponding to the corresponding sound object is obtained from the atom database, thereby improving the quality of the target audio data obtained later. Some policies may be preset, such as isolating all ambient sounds, isolating some of the ambient sounds, not isolating ambient sounds, or retaining audio data of an extracted sound object when the magnitude of the audio data is greater than a preset value, etc. When isolating all the environmental sounds, then m=0; when the environmental sound of the part is isolated, M is more than 0 and less than or equal to N; when the ambient sound is not isolated, then m=n.

In one possible implementation, after obtaining the M audio data, the method further includes: the gain of the channel included in each of the M audio data is adjusted to a target value. Therefore, the loudness of the audio data and the like are improved, the environment sound can be restored more truly, and the user experience is improved.

In one possible implementation, each audio data expresses the same emotion as the environment data expresses. Therefore, the target audio data is further matched with the environment information, and the user experience is improved.

In a second aspect, the present application provides a sound processing method, which may include: acquiring environment information associated with target equipment, wherein the environment information comprises first audio data and second audio data which need to be played simultaneously in an environment where the target equipment is located, and the first audio data and the second audio data are played through the same equipment, wherein the first audio data are continuously played audio data in a first time period, and the second audio data are sporadically played audio data in the first time period; acquiring second audio data to be played; according to the second audio data, extracting third audio data to be played from the first audio data, and performing target processing on the third audio data to obtain fourth audio data, wherein playing time periods corresponding to the second audio data and the fourth audio data are the same, and the target processing comprises voice elimination or voice reduction; according to the second audio data, determining a first gain required to be adjusted for the second audio data, and adjusting the gain of each channel in the second audio data based on the first gain to obtain fifth audio data; determining a second gain to be adjusted for the fourth audio data according to the fourth audio data or the fifth audio data, and adjusting the gain of each channel in the fourth audio data based on the second gain to obtain sixth audio data; obtaining target audio data based on the fifth audio data and the sixth audio data, wherein the target audio data is matched with the environment information; outputting the target audio data.

Therefore, by performing voice elimination or voice reduction processing and the like on the continuously played audio data and simultaneously broadcasting the sporadically played audio data and the processed audio data needing to be continuously played, a user can clearly perceive information contained in the sporadically played audio data and simultaneously can clearly perceive tunes, background sounds and the like of other audio data, so that the hearing of the user is more effectively met, and the user experience is improved. For example, the audio data that is continuously played (i.e., the first audio data) may be some type of music, and the audio data that is sporadically played (i.e., the second audio data) may be navigation audio data that needs to be broadcasted during navigation. By way of example, human voice cancellation may be understood as the cancellation of human voice in audio data, and human voice reduction may be understood as the reduction of human voice in audio data.

In some embodiments, the method may be applied in the scenario described below in fig. 4. In this case, the target device may be a vehicle or an electronic device in the vehicle. The target device may be a device integrated in the vehicle, such as an in-vehicle terminal, or may be a device separate from the vehicle, such as a driver's mobile phone, for example.

In some embodiments, the method may be, but is not limited to being applied to a first device, which may be a device that plays first audio data and second audio data.

In one possible implementation, the second audio data is the first data, or the fourth audio data is the first data; the method for determining the gain to be adjusted of the first data according to the first data specifically comprises the following steps: acquiring audio features of the first data, the audio features including one or more of: time domain features, frequency domain features, or music theory features; according to the audio characteristics, the gain to be adjusted for the first data is determined. For example, the audio features may be processed based on a pre-set gain calculation formula to obtain the gain to be adjusted.

In some embodiments, when the first data is second audio data, the audio features may be, but are not limited to, temporal features such as loudness, envelope energy, or short-time energy, etc. The loudness may be the loudness at various times in the second audio data, or the maximum loudness, etc.

In some embodiments, when the first data is fourth audio data, the audio features may be, but are not limited to, time domain features (such as loudness, envelope energy, or short-time energy, etc.), frequency domain features (such as spectral energy of multiple frequency bands, etc.), music theory features (such as beat, mode, chord, pitch, timbre, melody, emotion, etc.).

In one possible implementation manner, determining the second gain to be adjusted for the fourth audio data according to the fifth audio data specifically includes: obtaining a maximum loudness value of the fifth audio data; and determining the second gain according to the maximum loudness value of the fifth audio data and a first proportion, wherein the first proportion is the proportion between the maximum loudness value of the second audio data and the maximum loudness value of the fourth audio data.

In one possible implementation, after determining the second gain, the method further includes: the second gain is modified based on the first gain. Thereby making the sound generated at the subsequent play of the fifth audio data more perceptible. Illustratively, the second gain is modified based on a linear relationship between the first gain and the second gain that is preset.

In one possible implementation, after determining the second gain, the method further includes: determining that the second gain is greater than a preset gain value; and updating the second gain to a preset gain value. For example, when the second gain is greater than the preset gain value, it indicates that the sound generated by playing the fourth audio data is smaller, which has less influence on the sound generated by playing the fifth audio data obtained later, so that the determined value of the second gain can be updated to the preset gain value.

In one possible implementation manner, the adjusting the gain of each channel in the fourth audio data based on the second gain specifically includes: after the fourth audio data play starts, gradually adjusting the gain of each channel in the fourth audio data to a second gain according to a first preset step length within a first duration which is a first preset time from the moment when the fourth audio data play starts; and gradually adjusting the gain of each channel in the fourth audio data from the second gain to a preset gain value according to a second preset step length within a second duration which is a second preset time from the time when the fourth audio data is finished. Therefore, the situation of abrupt change of volume is avoided, the volume of sound perceived by a user is gradually changed, and user experience is improved.

In one possible implementation manner, the adjusting the gain of each channel in the fourth audio data based on the second gain specifically includes: gradually adjusting the gain of each channel in the fourth audio data to a second gain according to a first preset step length in a first duration which is a first preset time from the moment when the fourth audio data starts to be played before the fourth audio data starts to be played; and gradually adjusting the gain of each channel in the fourth audio data from the second gain to a preset gain value according to a second preset step length within a second duration which is a second preset time from the time when the fourth audio data is finished. Therefore, the situation of abrupt change of volume is avoided, the volume of sound perceived by a user is gradually changed, and user experience is improved.

In a third aspect, the present application provides a sound processing method, which may include: the method comprises the steps that first equipment obtains a first message sent by second equipment, and the first message is sent when the second equipment needs to broadcast audio data; in response to the first message, the first device performs target processing on audio data to be played by the first device, and plays the audio data subjected to the target processing, wherein the target processing is used for eliminating or reducing target sound in the audio data; the first equipment acquires a second message sent by the second equipment, wherein the second message is sent when the second equipment finishes broadcasting the audio data; in response to the second message, the first device stops target processing of its audio data to be played and plays audio data that has not been target processed.

In this way, in the process of broadcasting the audio data by the electronic equipment for sporadically broadcasting the audio data, the interference of the audio data broadcasted by the electronic equipment for continuously broadcasting the audio data can be reduced, so that the user can clearly perceive the audio data broadcasted by the electronic equipment for sporadically broadcasting the audio data. For example, sporadically playing audio data may be audio data at the time of a call, and continuously playing audio data may be some type of music.

In some embodiments, the method may be applied in a home scene, where the second device may be a mobile phone, and the first device may be a smart speaker, a smart television, or the like. In this scenario, the first device may be playing music, a television play, a movie, or the like, and the audio data to be broadcasted by the second device may be audio data to be broadcasted by the second device when the user uses the second device to make a call. In addition, the method can also be applied to driving scenes, and at the moment, the second equipment can be a mobile phone, and the first equipment can be a vehicle-mounted terminal. In this scenario, the first device may be playing music, and the audio data to be played by the second device may be audio data to be played by the second device when the user uses the second device to navigate or talk.

In one possible implementation, the target process includes a human voice cancellation process or a human voice reduction process.

In a fourth aspect, the present application provides a sound processing method, which may include: acquiring environment information associated with a target device, wherein the environment information comprises a target position of the target device in a target space, and at least one loudspeaker is configured in the target space; determining the distance between the target device and N speakers to obtain N first distances, wherein N is a positive integer, and the N speakers and the target device are in the same space; constructing a target virtual speaker group according to the N first distances and the N speakers, wherein the target virtual speaker group consists of M target virtual speakers, the M target virtual speakers are positioned on a circle which takes the position of target equipment as the center and takes the target distance in the N first distances as the radius, the value of M is equal to the number of speakers required for constructing space surround sound, the arrangement mode of the M target virtual speakers is the same as the arrangement mode of the speakers required for constructing the space surround sound, and each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the N speakers; according to gains required to be adjusted in the N speakers and corresponding to audio signals of speakers associated with the target virtual speaker, the gains of all channels in the original audio data are adjusted to obtain target audio data, wherein the target audio data are matched with the environment information; outputting the target audio data. In this way, the target electronic device is positioned in the space, and the gains of the audio signals output by the speakers in the space are adjusted, so that a user can enjoy the space surround sound at any time and any place. The arrangement of speakers required to build spatial surround sound may be, for example, the arrangement required in the 5.1.X or 7.1.X requirements. In some embodiments, the method may be applied in the scenarios described below in fig. 9 or 10. The target device may be the electronic device 100 in fig. 10.

In some embodiments, an audio signal may be included in the audio data, but is not limited to including audio signals that each corresponding speaker is required to play. For example, each audio signal included in one audio data may correspond to one channel. In one possible implementation, the target distance is the smallest of the N first distances. Therefore, the loudspeakers can be virtualized to the area closest to the target equipment, and the space surround sound effect is improved.

In one possible implementation manner, the constructing a target virtual speaker group according to the N first distances and the N speakers specifically includes: determining gains to be adjusted for audio signals corresponding to all speakers except the target speaker in the N speakers by taking the target distance as a reference to construct a first virtual speaker group, wherein the first virtual speaker group is a combination of speakers obtained by virtualizing the N speakers to a circle with the target device as a center and with the target distance as a radius, and the target speaker is a speaker corresponding to the target distance; and determining a target virtual speaker group according to the first virtual speaker group and the arrangement mode of speakers required for constructing the space surround sound, wherein a center speaker in the target virtual speaker group is positioned in a preset angle range in the current direction of target equipment.

For example, the target distance and the distances between the speakers except the target speaker and the target device may be processed based on a preset gain calculation model with reference to the target distance, so as to obtain gains to be adjusted for the audio signals corresponding to the speakers except the target speaker, thereby constructing the first virtual speaker group. Next, a target virtual speaker group may be determined from the first virtual speaker group based on the arrangement of speakers required to construct the spatial surround sound. When a certain virtual speaker in the target virtual speaker group is not in the first virtual speaker group, the virtual speaker in the first virtual speaker group can be processed through the VBAP algorithm to construct a virtual speaker in the target virtual speaker group. The manner of determining the target virtual speaker group may be described below with reference to fig. 11.

In one possible implementation manner, the constructing a target virtual speaker group according to the N first distances and the N speakers specifically includes: according to N speakers, N first distances, a layout mode of speakers required by space surround sound is built, the direction of target equipment and the position of the target equipment are built, a first virtual speaker group is built, the first virtual speaker group comprises M first virtual speakers, and each first virtual speaker is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the N speakers; determining second distances between the target device and each of the first virtual speakers to obtain M second distances; and the M first virtual speakers are all virtual to a circle taking the position of the target equipment as the center and taking one of the second distances as the radius, so that the target virtual speaker group is obtained. That is, a certain number of virtual speakers (i.e., the number of speakers required to construct a spatial surround sound) may be determined first, and then the virtual speakers are virtualized to the same circle to obtain the target virtual speaker group. The manner of determining the target virtual speaker group may be described below with reference to fig. 17.

In one possible implementation, before determining the distances between the target device and the N speakers, the method further includes: according to the speakers configured in the space where the target device is located, the orientation of the target device, the position where the target device is located, and the arrangement mode of the speakers required for constructing the space surround sound, N speakers are screened out from the speakers configured in the space where the target device is located, and are used for constructing the space surround sound. That is, the real speakers required to construct the spatial surround sound may be screened out first, and then the virtual speakers required may be constructed from the real speakers. The manner of determining the target virtual speaker group may be described below with reference to fig. 19.

In one possible implementation, the method further includes: determining a distance between the target device and each speaker in the target space; determining delay time of each loudspeaker in the target space when playing the audio data according to the distance between the target device and each loudspeaker in the target space; each speaker in the target space is controlled to play the audio data according to the corresponding delay time. Therefore, synchronous playing of all the loudspeakers is controlled, and user experience is improved.

In a fifth aspect, the present application provides a sound processing method, which may include: acquiring environment information associated with target equipment, wherein the environment information comprises a target position of a picture generated by the target equipment in a target space, and at least one loudspeaker is configured in the target space; constructing a virtual space matched with the target space according to the target position, wherein the volume of the virtual space is smaller than that of the target space; constructing a target virtual speaker group in a virtual space according to the positions of the speakers in the target space, wherein the target virtual speaker group comprises at least one target virtual speaker, and each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to one speaker in the target space; according to the gain required to be adjusted of the audio signal corresponding to the loudspeaker associated with the target virtual loudspeaker in the target space, the gain of each channel in the original audio data is adjusted to obtain target audio data, wherein the target audio data is matched with the environment information; outputting the target audio data.

In this way, a virtual speaker group is constructed at the target position in combination with the target position of the picture generated by the target equipment in space, and the audio data in the target equipment is controlled to be played by the virtual speaker group, so that the picture played by the target equipment is synchronous with the audio data, and the listening experience and the viewing consistency experience of the user are improved. In some embodiments, the method may be applied in the scenario described below in fig. 20. The target device may be the electronic device 100 in fig. 20. At this time, the original audio data may be audio data played by the user using the target device.

In one possible implementation manner, a target virtual speaker group is constructed in a virtual space according to the positions of the speakers in the target space, and specifically includes: determining the position of each target virtual speaker in the target virtual speaker group in the virtual space according to the proportion between the virtual space and the target space; according to the distances between each target virtual speaker and the target speakers corresponding to each target virtual speaker, determining the gains required to be adjusted for the audio signals corresponding to each target speaker so as to obtain a target virtual speaker group, wherein the target speakers are speakers in a target space.

In one possible implementation, the method further includes: determining the distance between a picture generated by the target device and each loudspeaker in the target space; determining delay time of each loudspeaker in the target space when playing the audio data according to the distance between the picture generated by the target device and each loudspeaker in the target space; each speaker in the target space is controlled to play the audio data according to the corresponding delay time. Therefore, synchronous playing of all the loudspeakers is controlled, and user experience is improved.

Further, the method may further include: selecting a distance from the determined distances between the picture generated by the target equipment and each loudspeaker in the target space as a reference distance; and determining the appearance time of the picture generated by the target device according to the reference distance. Therefore, the effect of sound and picture synchronization is improved. The reference distance may be, for example, the largest one of the determined distances between the screen generated by the target device and the respective speakers in the target space. For example, a delay time of the generated image relative to the sound generated by the speaker corresponding to the reference distance can be determined based on the reference distance and the propagation speed of the sound; and then, after the target equipment is controlled to play corresponding audio data at the moment of the loudspeaker corresponding to the reference distance, and when the delay time is reached, displaying a corresponding picture. For example, if the determined delay time is 3s and the time when the speaker corresponding to the reference distance plays the corresponding audio data is t, the time when the picture generated by the target device appears is (t+3).

In a sixth aspect, the present application provides a sound processing method, which may include: acquiring state information of a user associated with target equipment, wherein the state information of the user comprises a target distance between the target equipment and a head of the target user, and a target position of the head of the target user in a target space, wherein at least one loudspeaker is configured in the target space; constructing a target virtual speaker group according to the target distance, the target position and the positions of the speakers in the target space, wherein the target virtual speaker group comprises at least one target virtual speaker, each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to one speaker in the target space, and each target virtual speaker is positioned on a circle taking the target position as a circle center and taking the target distance as a radius; according to the gain required to be adjusted of the audio signal corresponding to the loudspeaker associated with the target virtual loudspeaker in the target space, the gain of each channel in the original audio data is adjusted to obtain target audio data, wherein the target audio data is matched with the state of the user; outputting the target audio data. In this way, a virtual speaker group is constructed around the target user by combining the target distance between the target device and the head of the target user, the target position of the head of the target user in the target space and the like, and the audio data in the target device is controlled to be played by the virtual speaker group, so that the picture played by the target device is synchronous with the audio data, and the listening experience and the viewing consistency experience of the user are improved. In some embodiments, the method may be applied in the scenario described below in fig. 24. The target device may be the electronic device 100 in fig. 24. At this time, the original audio data may be audio data played by the user using the target device.

In one possible implementation, after constructing the target virtual speaker group according to the target distance, the target position, and the positions of the speakers in the target space, the method further includes: according to the target virtual speaker group, a first virtual speaker group is constructed, the first virtual speaker group is composed of M virtual speakers, the M virtual speakers are located on a circle with a target position as a center and a target distance as a radius, the value of M is equal to the number of speakers required for constructing space surround sound, the arrangement mode of the M virtual speakers is the same as the arrangement mode of the speakers required for constructing space surround sound, and each virtual speaker in the M virtual speakers is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the target space.

At this time, according to the gain to be adjusted for the audio signal corresponding to the speaker associated with the target virtual speaker in the target space, the gain of each channel in the original audio data is adjusted to obtain the target audio data, which specifically includes: and adjusting the gain of each channel in the original audio data according to the gain required to be adjusted in the target space and corresponding to the audio signals of the speakers associated with the M virtual speakers, so as to obtain target audio data. Therefore, virtual speakers required by playing the space surround sound are constructed, and target audio data can be played through the virtual speakers, so that a user can hear the space surround sound, and user experience is improved.

In one possible implementation manner, the target virtual speaker group includes S virtual speakers, where the S virtual speakers are speakers required for building space surround sound, and each virtual speaker in the S virtual speakers is obtained by adjusting a gain of an audio signal corresponding to at least one speaker in the N speakers; determining the distance between the target position and each virtual speaker in the S virtual speakers to obtain S distances; and (3) virtualizing the S virtual speakers to a circle with the target position as the center and one of the S distances as the radius to obtain a required virtual speaker group, and adjusting the original audio data based on gains required to be adjusted for audio signals corresponding to each real speaker determined in the process of constructing the required virtual speaker group to obtain target audio data. That is, a certain number of virtual speakers (i.e., the number of speakers required for constructing a space surround sound) may be determined first, and then the virtual speakers are virtualized to the same circle to obtain a required virtual speaker group; finally, the original audio data can be adjusted based on the gains to be adjusted for the audio signals corresponding to the real speakers determined in the process of constructing the required virtual speaker group, so as to obtain the target audio data.

In one possible implementation, the method may further include: according to the target distance, the target position, the positions of the speakers in the target space and the arrangement mode of the speakers required by building space surround sound, N speakers are screened out from speakers configured in the space where the target equipment is located, and the N speakers are used for building the space surround sound; and according to the N loudspeakers, determining a required virtual loudspeaker group, and adjusting the original audio data based on gains required to be adjusted for audio signals corresponding to each real loudspeaker determined in the process of constructing the required virtual loudspeaker group so as to obtain target audio data. That is, the real speakers required for constructing the spatial surround sound can be screened out first, and then the virtual speakers required can be constructed from the real speakers; finally, the original audio data can be adjusted based on gains to be adjusted for the audio signals corresponding to the N real speakers determined in the process of constructing the required virtual speaker group, so as to obtain the target audio data.

In a seventh aspect, the present application provides a sound processing method, which may include: acquiring environmental information associated with a target device, wherein the target device is located in a vehicle, the environmental information including one or more of a travel speed, a rotational speed, and an opening degree of an accelerator pedal of the vehicle; determining first audio data from the original audio data according to at least one of the running speed, the rotating speed and the opening degree of an accelerator pedal, wherein the first audio data is obtained by performing telescopic transformation on target audio particles in the original audio data based on the running speed; determining acceleration of the vehicle according to the running speed, adjusting gains of all sound channels in the first audio data according to the acceleration to obtain second audio data, and determining a target speed of a sound field in the vehicle moving towards a target direction; determining a virtual position of a sound source of the target audio data according to the target speed; according to the virtual positions, determining target gains to be adjusted of audio signals corresponding to a plurality of speakers in the vehicle, and obtaining F target gains, wherein F is more than or equal to 2; according to the F target gains, the gains of all the channels in the second audio data are adjusted to obtain target audio data, wherein the target audio data are matched with the environment information; outputting the target audio data. In this way, the sound heard by the driver in the vehicle can be associated with the running speed of the vehicle, so that the hearing is more realistic, and the user experience is improved.

In some embodiments, the method may be applied to a scenario of "controlling acceleration running of a new energy vehicle" described below. At this time, in the process of driving the vehicle by the user, the movement of the sound field in the vehicle is controlled according to the loudspeaker in the vehicle, so that the sound wave sound can generate spatial change, the Doppler effect can be generated in the vehicle, the sound wave sound played by the vehicle is matched with the real driving state, the hearing is more real, and the user experience is improved. In this case, the target device may be a vehicle or an electronic device in the vehicle. The target device may be a device integrated in the vehicle, such as an in-vehicle terminal, or may be a device separate from the vehicle, such as a driver's mobile phone, for example.

In one possible implementation manner, before adjusting the gain of each channel in the first audio data according to the driving speed, the method further includes: determining that the change value of the running speed exceeds a preset speed threshold value; and/or determining that the adjustment value corresponding to the gain of each channel in the first audio data is smaller than or equal to a preset adjustment value, wherein when the target adjustment value corresponding to the gain of the target channel in the first audio data is larger than the preset adjustment value, the target adjustment value is updated to the preset adjustment value. Therefore, the negligence of the sound heard by the user or abrupt change of the sound is avoided, and the user experience is improved.

In one possible implementation, the target parameter further includes an acceleration duration of the vehicle, and the method further includes: and controlling the atmosphere lamp in the vehicle to work according to the acceleration time. Thereby providing a visual experience for the user. In addition, the speed of the color change of the atmosphere lamp can be controlled to be the same as the target speed of the sound field movement in the vehicle, so that the space hearing feeling and the space vision feeling in the vehicle are corresponding, and the user experience is improved.

In an eighth aspect, the present application provides a sound processing method, which may include: acquiring state information of a user associated with the target device, wherein the state information comprises fatigue levels of the user; determining a target adjustment value of a first characteristic parameter according to the fatigue grade, wherein the first characteristic parameter is a characteristic parameter of original audio data to be played currently, and the first characteristic parameter comprises a tone and/or a loudness; processing the original audio data according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than that of the first characteristic parameter, and the target audio data is matched with the state information of the user; outputting the target audio data. Thus, when the fatigue of the user is detected, the characteristic parameters (such as tone, loudness and the like) of the original audio data can be changed according to the fatigue level of the user, so that the played audio data can impact the user in an audible manner, and the attention of the user is improved. In some embodiments, the method may be applied in the scenario described below in fig. 35. In this scenario, the target device may be a vehicle, or may be an electronic device in a vehicle. The target device may be a device integrated in the vehicle, such as an in-vehicle terminal, or may be a device separate from the vehicle, such as a driver's mobile phone, for example. In addition, in this scenario, the original audio data may be audio data of a navigation sound to be played.

In one possible implementation, outputting the target audio data specifically includes: determining a first target prompt tone according to the fatigue level; and outputting target audio data and first target prompt voice according to a preset broadcasting sequence. Therefore, the method further generates impact on the user in hearing, enables the broadcasting mode and language to be more vivid and humanized, and improves the user experience. For example, the first target alert voice may be the alert voice shown in "table 2" below.

In one possible implementation, the method further includes: determining a second target prompt tone according to the fatigue level and the map information; and outputting a second target prompt tone. Thereby further audibly impacting the user to thereby increase the user's attention. The second target-alert voice may be, for example, "attention-! Attention-! Drivers have become extremely tired and can stop and rest at xxx intersections/supermarkets/transfer stations that are far from xxx meters.

In one possible implementation, the target device is located in a vehicle. At this time, before outputting the target audio data, the method further includes: and determining that the vehicle is in an automatic driving state, the road condition of the road section where the vehicle is located is lower than a preset road condition threshold value, and/or determining that the road section where the vehicle is located is a preset road section. Thereby, to improve the user's attention under specific conditions.

In one possible implementation, the method further includes: and determining the flicker frequency and/or the color of the warning lamp according to the fatigue grade, and controlling the warning lamp to work according to the determined flicker frequency and/or the determined color. Thereby giving visual impact to the user and further improving the user's attention.

In a ninth aspect, the present application provides a sound processing method, which may include: acquiring state information of a user associated with the target device, wherein the state information comprises first audio data and second audio data selected by the user; determining a first audio feature of the first audio data, the first audio feature comprising: loudness at various moments and/or position points of various beats; adjusting second audio features of the second audio data according to the first audio features to obtain third audio data, wherein the second audio features comprise at least one of loudness, tone and speed of sound; obtaining target audio data according to the first audio data and the third audio data, wherein the target audio data is matched with the state information of the user; outputting the target audio data. In this way, one audio data can be processed based on the other audio data selected by the user, so that the two audio data can be naturally fused together, and better hearing experience is brought to the user. In some embodiments, the method may be applied to a scenario of "user selection of multiple audio data superimposed playback" described below. In this scenario, the first audio data may be background sound and the second audio data may be white noise.

In one possible implementation, the first audio feature comprises: the loudness of the first audio data at each instant in time, and the second audio feature comprises loudness. According to the target audio characteristics, adjusting the second audio characteristics of the second audio data specifically comprises: according to the loudness of each moment of the first audio data and the preset loudness proportion, determining the target loudness corresponding to each moment of the second audio data; and adjusting the loudness of each moment in the second audio data to the target loudness corresponding to each moment in the second audio data. Thus, the loudness of each moment in the two audio data is matched with the preset loudness proportion, so that the two audio data can be naturally fused together.

In one possible implementation, the target audio features include: the second audio features include pitch and/or speed of sound at the location points of the individual beats. According to the target audio characteristics, the tone of the second audio data is adjusted, and the method specifically comprises the following steps: for any two adjacent beats in the first audio data, determining target rhythms corresponding to the any two adjacent beats according to the any two adjacent beats; determining a target adjustment value of a second audio feature of the second audio data in a position point corresponding to any two adjacent beats according to the target rhythm; and adjusting the second audio characteristics of the second audio data in the position points corresponding to any two adjacent beats according to the target adjustment value. Thus, the audio characteristics of the second audio data can be matched with the rhythm of the first audio data, so that the second audio data and the first audio data can be naturally fused together.

In a tenth aspect, the present application provides a sound processing method, which may include: acquiring state information of a user associated with the target device, the state information of the user including one or more of: the picture, the video or the audio data added by the user for the target object; determining N pictures, wherein N is more than or equal to 2; determining target objects contained in each picture in the N pictures to obtain M target objects, wherein M is more than or equal to 1; determining the spatial position of each target object in each picture in N pictures, and determining the time length of each target object in the target video to obtain M first time lengths, wherein the target video is obtained based on the N pictures; determining the moving speed of each target object between each adjacent picture according to the space position of each target object and the moment when each adjacent picture in N pictures appears in the target video; q first audio data are obtained according to M target objects, Q is more than or equal to 1 and less than or equal to M, wherein one first audio data is at least related to one target object; the second time length of each first audio data is adjusted to be equal to the first time length corresponding to the corresponding target object, so that Q second audio data are obtained; according to the space position of each target object and the moving speed of each target object between each adjacent picture, respectively processing the second audio data corresponding to each target object to obtain Q third audio data; obtaining a target video according to the Q pieces of third audio data and the N pictures, wherein the target video comprises target audio data, the target audio data is obtained based on the Q pieces of third audio data, and the target audio data is matched with state information of a user; outputting the target audio data. Therefore, based on the data selected by the user, the spatial audio is added for the target object in the data, so that the sound of the target object in the manufactured video can move along with the movement of the target object, the hearing of the user is more real, and the viewing experience is improved. In some embodiments, the method may be applied to a scene of "make video or moving pictures" described below. In some embodiments, the duration of the target video may be calculated by playing a picture at a fixed time, or may be obtained by selecting a duration of a piece of audio data.

In one possible implementation, the method further includes: according to the N pictures, fourth audio data matched with the N pictures are determined; and taking the position point of at least one part of beats in the fourth audio data as the moment when at least one part of the N pictures appears, and/or taking the position point of the beginning or ending of at least one part of bars in the fourth audio data as the moment when at least one part of the N pictures appears. Therefore, the time when at least one part of the N pictures appears can be consistent with the position points of some beats or the position points of some bars, so that visual impact changes are presented at the key points of the sense of hearing, namely, the user can watch the pictures at the key points of the sense of hearing, thereby generating consistent impact sense on the sense of hearing and further improving the user experience.

In one possible implementation manner, determining the spatial position of each target object in each of the N images specifically includes: and determining a first space position of the kth target object in the ith picture based on a preset three-dimensional coordinate system, wherein the center point of the three-dimensional coordinate system is the center position of the ith picture, the ith picture is any one picture in the N pictures, and the kth target object is any one target object in the ith picture.

In one possible implementation, the method further includes: determining that a kth target object does not exist in the (i+1) th picture; and taking the first position on the first boundary of the (i+1) th picture as the second spatial position of the kth target object in the (i+1) th picture. Thereby avoiding the abrupt disappearance of the sound of the kth target object in the (i+1) th picture.

In one possible implementation, the first boundary is a boundary of the kth target object in the i-th picture in the target orientation direction, the first position starts in the (i+1) -th picture with the first spatial position, and an intersection of a straight line extending in the target orientation direction and the first boundary.

In one possible implementation, the method further includes: determining that a kth target object does not exist in the (i+2) th picture; determining a first moving speed and a first moving direction of a kth target object according to the first space position, the second space position and a time interval between an ith picture and an (i+1) th picture; taking a second position except the (i+2) th picture as a third spatial position of the kth target object in the (i+2) th picture; the second position is a position point in the first moving direction and separated from the second space position in the (i+2) th picture by a first target distance, and the first target distance is obtained according to the first moving speed and the time interval between the (i+1) th picture and the (i+2) th picture. Therefore, the sound of the kth target object gradually goes far to the target direction instead of suddenly disappearing, and the user experience is improved.

In one possible implementation, the method further includes: determining that a kth target object does not exist in the (i-1) th picture, wherein i is more than or equal to 2; and taking the third position on the second boundary of the (i-1) th picture as the fourth spatial position of the kth target object in the (i-1) th picture. Thereby avoiding the abrupt occurrence of sound of the kth target object in the ith picture.

In one possible implementation, the second boundary is a boundary of the kth target object in a direction opposite to the target direction in the ith picture, the third position starts from the first spatial position in the (i-1) th picture, and an intersection of a straight line extending in the direction opposite to the target direction and the second boundary.

In one possible implementation, the method further includes: determining that a kth target object does not exist in the (i-2) th picture, wherein i is more than or equal to 3; determining a second moving speed and a second moving direction of the kth target object according to the first space position, the fourth space position and the time interval between the ith picture and the (i-1) th picture; taking a fourth position except the (i-2) th picture as a fifth spatial position of the kth target object in the (i-2) th picture; the fourth position is a position point in the opposite direction of the second moving direction and separated from the fourth space position in the (i-2) th picture by a second target distance, and the second target distance is obtained according to the second moving speed and the time interval between the (i-1) th picture and the (i-2) th picture. Therefore, the sound of the kth target object gradually approaches to the target direction, but does not suddenly appear in the ith picture, and the user experience is improved.

In one possible implementation, the method further includes: determining that the kth target object does not exist in the (i+1) th picture to the (i+j) th picture, j is more than or equal to 2, and the kth target object exists in the (i+j+1) th picture, wherein (i+j+1) is less than or equal to N; respectively determining the spatial positions of the kth target object in each of the (i+1) th to (i+j) th pictures based on the ith picture to obtain a first set of spatial positions { P } _i+1 ,…,P _i+j }, wherein P _i+j Is the firstThe spatial positions of the k target objects in the (i+j) th picture are respectively determined by taking the (i+j+1) th picture as a reference, so as to obtain a second spatial position set { P' _i+1 ,…,P′ _i+j And (3) wherein P' _i+j The spatial position of the kth target object in the (i+j) th picture; and determining the spatial position of the kth target object in each of the (i+1) th picture to the (i+j) th picture according to the first spatial set and the second spatial set. Thereby improving the accuracy of the kth target object in the spatial position of each of the (i+1) th to (i+j) th pictures.

In one possible implementation manner, according to the first space set and the second space set, determining a spatial position of the kth target object in each of the (i+1) th to (i+j) th pictures specifically includes: according to the first space set and the second space set, determining the distance between two space positions of a kth target object in each of the (i+1) th picture to the (i+j) th picture respectively so as to obtain j distances; according to the first space set and the second space set, determining the space position of a kth target object in an (i+c) th picture, wherein the (i+c) th picture is a picture corresponding to one of j distances, and c is more than or equal to 1 and less than or equal to j; according to the spatial position of the kth target object in the ith picture, the spatial position of the kth target object in the (i+j+1) th picture, the spatial position of the kth target object in the (i+c) th picture, and the moment when each of the ith to (i+j+1) th pictures appears in the target video, the spatial position in each of the ith to (i+c) th pictures of the kth target object is determined, and the spatial position in each of the (i+c) th to (i+j+1) th pictures of the kth target object is determined.

In an eleventh aspect, the present application provides an electronic device, including: at least one memory for storing a program; at least one processor for executing programs stored in the memory; wherein the processor is configured to perform the method provided in any one of the first to tenth aspects when the program stored in the memory is executed.

In a twelfth aspect, the present application provides a computer readable storage medium storing a computer program which, when run on an electronic device, causes the electronic device to perform the method provided in any one of the first to tenth aspects.

In a thirteenth aspect, the present application provides a computer program product for, when run on an electronic device, causing the electronic device to perform the method provided in any one of the first to tenth aspects.

In a fourteenth aspect, the present application also provides a chip comprising a processor coupled to a memory for reading and executing program instructions stored in the memory to cause the chip to implement the method provided in any one of the above first to tenth aspects. It will be appreciated that the advantages of the eleventh to fourteenth aspects may be found in the relevant description of the first to tenth aspects, and are not described here again.

Drawings

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flow chart of a sound processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a display interface of an electronic device according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 5 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a time domain waveform and an envelope of audio data according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a spectrogram obtained after performing short-time Fourier transform on audio data according to an embodiment of the present application;

FIG. 8 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 11 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 12 is a schematic view of an electronic device according to an embodiment of the present disclosure;

fig. 13 is a schematic diagram of a virtual speaker constructed according to an embodiment of the present application;

Fig. 14 is a schematic diagram of a process for constructing a virtual speaker according to an embodiment of the present application;

fig. 15 is a schematic diagram of a process for constructing a virtual speaker group according to an embodiment of the present application;

fig. 16 is a schematic diagram of another process for constructing a virtual speaker group according to an embodiment of the present application;

FIG. 17 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 18 is a schematic diagram of a process for constructing a virtual speaker according to an embodiment of the present application;

FIG. 19 is a flow chart of yet another method for processing sound according to an embodiment of the present application;

FIG. 20 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 21 is a schematic illustration of a three-point positioning provided in an embodiment of the present application;

FIG. 22 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 23 is a schematic diagram of building a virtual space according to one embodiment of the present application;

FIG. 24 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 25 is a schematic diagram of constructing a virtual speaker group in a virtual space according to an embodiment of the present application;

FIG. 26 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 27 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 28 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 29 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 30 is a schematic diagram of a hardware configuration of a vehicle according to an embodiment of the present application;

FIG. 31 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 32 is a schematic diagram of sound field movement provided by an embodiment of the present application;

FIG. 33 is a schematic diagram of sound field movement provided by an embodiment of the present application;

FIG. 34 is a schematic view showing the color of an atmosphere lamp in a vehicle gradually changing along with the acceleration period of the vehicle according to an embodiment of the present application;

FIG. 35 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 36 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 37 is a schematic illustration of a process for performing a variable speed, non-variable pitch process on sound according to one embodiment of the present application;

FIG. 38 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 39 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 40 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 41 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

FIG. 42 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 43 is a schematic diagram of adjusting a moment of appearance of a picture to a position point of a beat according to an embodiment of the present application;

FIG. 44 is a schematic diagram of determining a spatial position of a target object in a picture according to an embodiment of the present application;

FIG. 45 is a schematic diagram of determining a spatial position of a target object in a picture according to an embodiment of the present disclosure;

FIG. 46 is a flowchart of a sound processing method according to an embodiment of the present disclosure;

fig. 47 is a schematic hardware structure of an electronic device according to an embodiment of the present application;

fig. 48 is a software block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.

The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first response message and the second response message, etc. are used to distinguish between different response messages, and are not used to describe a particular order of response messages.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise specified, the meaning of "a plurality of" means two or more, for example, a plurality of processing units means two or more processing units and the like; the plurality of elements means two or more elements and the like.

Exemplary, the embodiments of the present application provide a sound processing method, which can process original audio data according to external information input to construct audio data to be played. For example, according to the method, the audio data to be played, which is adapted to the current environment or the current user state, can be constructed according to the environment information associated with the electronic device and/or the user state information, so that the audio data to be played can be fused with the current environment or the current user state, and the user experience is improved. When the audio data to be played, which is adapted to the current environment or the state of the current user, is constructed, the audio data required can be obtained by adjusting the audio characteristics (such as gain, tone or loudness, etc.) of the audio data to be played, and/or the audio data of the target object, which is adapted to the current environment, is combined. For another example, the method may construct audio data to be played adapted to the photographed picture or video according to information related to the photographed picture or video.

In some embodiments, the environmental information associated with the electronic device may include one or more of the following: environmental data (such as environmental images, environmental sounds, weather information or season information) of an area where the electronic device is located, whether different audio data needs to be played simultaneously in the environment where the electronic device is located, the position of the electronic device in space, the position of a picture generated by the electronic device in space, or when the electronic device is located in a vehicle, running parameters (such as running speed) of the vehicle, and the like.

The status information of the user associated with the electronic device may include one or more of: the fatigue level of the user, the distance between the electronic device and the user's head and the position of the user's head in space, audio data selected by the user, or pictures or videos selected by the user, etc.

In the embodiment of the present application, the sound processing method mainly relates to the following several scenarios:

1. the scene of the ambient sound is fused in the vehicle. In this scenario, the audio data of each sound object adapted to the current environment may be determined from a pre-configured white noise atomic database by the electronic device in the vehicle in combination with the environmental data of the area in which the electronic device is located. And synthesizing the audio data of each determined sound object to obtain target audio data, and playing the target audio data. In this way, the driver or other user can hear sounds in the vehicle that match the external environment, thereby enabling the user to have an immersive experience. The atomic database of white noise can be configured with audio data of each single object in a specific period of time, such as audio data of water flow, audio data of cicada sound, audio data of vegetation, and the like. In this scenario, audio data to be played may be constructed according to environmental information associated with the electronic device, where the environmental information associated with the electronic device may be environmental data of an area where the electronic device is located.

2. One type of audio data is continuously played, and a scene of another type of audio data is sporadically played. The scene may include two types of scenes.

The first scenario is that continuously played audio data and sporadically played audio data are played through the same electronic device. In the scene, the electronic equipment can perform voice elimination or voice reduction processing and the like on the audio data which are continuously played, and can broadcast the audio data which are sporadically played and the audio data which are processed and need to be continuously played at the same time. Therefore, the user can clearly perceive the information contained in the sporadically played audio data and simultaneously can clearly perceive the tunes, background sounds and the like of other audio data, so that the hearing of the user is more effectively met, and the user experience is improved. For example, the audio data that is continuously played may be a certain type of music, and the audio data that is sporadically played may be navigation audio data that needs to be broadcasted during navigation.

The second scenario is that the audio data that is continuously played and the audio data that is sporadically played are played through different electronic devices. In this scenario, one electronic device (hereinafter referred to as a "first device") may continuously play one kind of audio data, and another electronic device may sporadically play another kind of audio data. In the scene, when the electronic equipment for sporadically playing the audio data needs to play the audio data, the electronic equipment can instruct the electronic equipment for continuously playing the audio data to execute the voice elimination or voice reduction operation; and after the electronic device broadcasting the audio data sporadically finishes, the electronic device can instruct the electronic device continuously broadcasting the audio data to stop executing the voice eliminating or voice reducing operation. In this way, in the process of broadcasting the audio data by the electronic equipment for sporadically broadcasting the audio data, the interference of the audio data broadcasted by the electronic equipment for continuously broadcasting the audio data can be reduced, so that the user can clearly perceive the audio data broadcasted by the electronic equipment for sporadically broadcasting the audio data. For example, sporadically playing audio data may be audio data at the time of a call, and continuously playing audio data may be some type of music.

In the above two scenarios, the audio data to be played may be constructed according to the environmental information associated with the electronic device, where the environmental information associated with the electronic device may be whether different audio data needs to be played simultaneously in the environment where the electronic device is located.

3. A scene of audio data is played using speakers provided in a space. The scene may include two types of scenes.

The first scenario may be: a plurality of speakers are arranged in space, and at least a part of the speakers are arranged according to certain requirements (e.g., 5.1.X, or 7.1.X, etc.). In addition, in this scenario, the electronic device or other device is playing audio data using a speaker. In the scene, the gains of the audio signals output by the speakers can be adjusted according to the position of the electronic equipment, so that a user can enjoy the space surround sound at any time and any place. In the scene, audio data to be played can be constructed according to environment information associated with the electronic equipment, wherein the environment information associated with the electronic equipment can be the position of the electronic equipment in space.

The second scenario may be: a plurality of speakers are arranged in the space and the electronic device can generate a picture (e.g., a user watching a movie using the electronic device, etc.), and the electronic device plays audio data thereon through the speakers arranged in the space. Under the scene, a virtual loudspeaker set can be constructed around the electronic equipment or a picture generated by the electronic equipment by combining the position of the electronic equipment, so that audio data in the electronic equipment can be played by the virtual loudspeaker set, the picture played by the electronic equipment is synchronous with the audio data, and the hearing and viewing consistency experience of a user is improved. In the scene, audio data to be played can be constructed according to environment information associated with the electronic equipment or state information of a user, wherein the environment information associated with the electronic equipment can be the position of a picture generated by the electronic equipment in space; the status information of the user includes a distance between the electronic device and the user's head, a position of the user's head in space, etc.

4. And controlling the scene of acceleration running of the new energy vehicle. Under the scene, the movement of a sound field in the vehicle can be controlled by the electronic equipment in the vehicle and in combination with the running parameters such as the running speed of the vehicle, so that the sound wave sound (such as the sound of an engine of an imitated fuel vehicle) can generate spatial variation, the Doppler effect can appear in the vehicle, the sound wave sound played by the vehicle is in accordance with the real driving state, the hearing is more real, and the user experience is improved. It should be understood that in the embodiments of the present application, a new energy vehicle refers to a vehicle that employs unconventional vehicular fuel as a power source (or employs conventional vehicular fuel, employing a new vehicular power plant). Such as: hybrid electric vehicles, pure electric vehicles, fuel cell electric vehicles, other new energy (such as super capacitors, flywheel and other high-efficiency energy storage devices) vehicles, and the like. Wherein, the unconventional vehicle fuel refers to fuels other than gasoline and diesel. In this scenario, audio data to be played may be constructed according to environmental information associated with the electronic device, where the environmental information associated with the electronic device may be a driving parameter of the vehicle.

5. And driving, navigating by using electronic equipment in the vehicle, and enabling a driver to have a driving fatigue scene. Under the scene, when the driver is detected to be tired, the characteristic parameters (such as tone, gain and the like) of the audio data broadcasted by the navigation can be changed according to the fatigue level of the driver, so that the broadcasted audio data can impact the driver in hearing, the attention of the driver is further improved, and safe driving is realized. In this scenario, audio data to be played may be constructed according to state information of a user associated with the electronic device, where the state information of the user associated with the electronic device may be a fatigue level of the user.

6. The user selects a scene in which a plurality of audio data are superimposed and played. In the scene, other audio data selected by the user can be modified based on at least one audio data selected by the user, so that the two audio data can be more naturally fused together, and better hearing experience is brought to the user. By way of example, the audio data selected by the user may include background sounds, white noise, and the like. In this scenario, audio data to be played may be constructed according to state information of a user associated with the electronic device, where the state information of the user associated with the electronic device may be audio data selected by the user.

7. A scene of a video or a moving picture is made. Under the scene, the spatial audio can be added to the target object in the picture or the video shot by the electronic equipment based on the picture or the video shot by the electronic equipment in the process of manufacturing the video or the dynamic picture, so that the sound of the target object in the manufactured video or the dynamic picture can move along with the movement of the target object, the hearing of a user is more real, and the viewing experience is improved. In the scene, audio data to be played can be constructed according to state information of a user associated with the electronic equipment, wherein the audio data is audio data of a target object in the manufactured video or dynamic picture. The status information of the user associated with the electronic device may be a picture, a video, and/or audio data added to the target object selected by the user.

Next, the sound processing method provided in the embodiment of the present application will be described by sequentially dividing the scenes based on the order of the above-described respective scenes.

1. The scene of the ambient sound is fused in the vehicle.

By way of example, fig. 1 illustrates an application scenario in some embodiments of the present application. As shown in fig. 1, driver a is located in vehicle 200. The electronic device 100 and the speaker 230 are configured in the vehicle 200, and the electronic device 100 is in a power-on state. The electronic device 100 may be a device integrated in the vehicle 200, such as an in-vehicle terminal, or may be a device separate from the vehicle 200, such as a mobile phone of the driver a, etc., which is not limited herein.

When the electronic device 100 is integrated in the vehicle 200, the electronic device 100 may directly broadcast audio data that it needs to broadcast using the speaker 230 in the vehicle 200. When the electronic device 100 is disposed apart from the vehicle 200, a connection may be established between the electronic device 100 and the vehicle 200 by, but not limited to, short-range communication (e.g., bluetooth, etc.). When the electronic device 100 is separately disposed from the vehicle 200, the electronic device 100 may transmit the audio data that needs to be broadcasted to the vehicle 200 and broadcast the audio data through the speaker 230 on the vehicle 200, or the electronic device 100 may broadcast the audio data that needs to be broadcasted through the speaker built in the electronic device.

In addition, an image pickup device 210 such as a camera may be provided outside the vehicle 200 to pick up an environmental image outside the vehicle 200. The exterior of the vehicle 200 may also be provided with a pickup 220, such as a microphone or the like, for picking up sounds in the environment.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the vehicle 200. In other embodiments of the present application, vehicle 200 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components.

By way of example, fig. 2 shows a sound processing method. In fig. 2, the electronic device 100 may be a device integrated in the vehicle 200, such as an in-vehicle terminal or the like, or may be a device separate from the vehicle 200, such as a mobile phone of the driver a or the like. In addition, the method shown in fig. 2 may be applied, but not limited to, driving scenes such as a scene when driving, etc., or outdoor camping scenes such as a scene camping in a valley or lakeside, etc. Further, controls may be provided on the electronic device 100 in fig. 2, but are not limited to, for initiating execution of the method, such as: the control may be named "camping mode" and when the user selects to turn on the camping mode, the method shown in fig. 2 may be performed. As shown in fig. 2, the method comprises the steps of:

s201, the electronic device 100 obtains environmental data of an area where the vehicle 200 is located, where the environmental data includes: one or more of environmental images, environmental sounds, weather information or season information, etc.

In this embodiment, the image capturing device 210 on the vehicle 200 may capture an environmental image of the area where the vehicle 200 is located in real time or periodically, and transmit the captured data to the electronic device 100. The pickup 220 on the vehicle 200 may collect environmental sounds of the area where the vehicle 200 is located in real time or periodically and transmit the collected data to the electronic device 100. In addition, the electronic device 100 may acquire weather information and/or season information of the area where the vehicle 200 is located through the network in real time or periodically.

S202, the electronic equipment 100 determines each currently required sound object according to the environment data.

In this embodiment, the electronic device 100 may input the environmental data into a pre-trained sound object detection model to output respective sound objects currently required by the sound object detection model. In some embodiments, the sound object detection model may be, but is not limited to being, trained based on convolutional neural networks (convolutional neural network, CNN).

For example, when the vehicle 200 is traveling on a road in a forest, the current day is a sunny day, the environment image may determine that the vehicle 200 is in the forest, the environment sound may determine that there is a bird call in the current environment, the weather information may determine that the current day is a sunny day, and the weather is a sunny day. In this way, each sound object determined is tree, bird call, daytime and clear.

In some embodiments, in addition to obtaining the currently required sound object from the sound object detection model, a sound theme adapted to the environment data may be determined according to the environment data. Then, the sound object contained in the sound theme is used as the current required sound object. Each sound theme comprises at least one sound object associated with the sound theme. For example, the sound theme may be "night cicada buzzing", and the sound objects included under the sound theme may be "cicada buzzing", "night and clear", "breeze", "flowing water"; the sound theme may be "summer and night storm", and the sound subjects included in the sound theme may be "storm", "thunder".

S203, the electronic device 100 determines, from the white noise atomic database, audio data of each sound object based on each sound object.

In this embodiment, after acquiring each sound object, the electronic device 100 may query the atomic database of white noise, thereby acquiring audio data of each sound object in a specific period of time. The atomic database of white noise is configured with audio data of each single object in a specific period of time, such as audio data of water flow, audio data of cicada sound, audio data of vegetation and the like. The audio data of a plurality of objects in the atomic database are randomly combined or combined according to a preset rule, and the audio data with a certain duration can be obtained. Illustratively, the white noise audio data in the atomic database may be configured in advance in the vehicle, or obtained from a server in real time, or the like.

In some embodiments, the atomic database may include audio data for one sound object over different time periods, and the audio data over different time periods may have different emotions. For example, when the sound object is a bird call, an atomic database may include a stretch of cheerful bird calls and a stretch of sad bird calls.

Further, when determining the audio data of each sound object, the audio data of each sound object adapted to the emotion expressed by the current environment data may be determined based on the current environment data. For example, when the weather is clear, the emotion expressed by the current environmental data can be determined to be happy, at this time, the audio data in each sound object currently required can be screened out from the atom database, and the emotion expressed by the audio data is all happy.

S204, the electronic device 100 synthesizes the audio data of each sound object to obtain target audio data, and plays the target audio data.

In this embodiment, the electronic device 100 may synthesize the audio data of each sound object to obtain the target audio data, and play the target audio data. Wherein, the electronic device 100 may play through the speaker of the vehicle 200 when playing the target audio data. In this way, the driver can hear the sound matching the external environment in the vehicle, so that the user can have an immersive experience.

In some embodiments, audio data of each sound object may be subjected to a mixing process by a mixing algorithm to obtain target audio data. Wherein, according to the type of the audio data, the audio mixing algorithm matched with the type can be selected for processing. For example, when the type of audio data is a floating point (float) type, the respective audio data may be directly superimposed and mixed to obtain the target audio data. When the type of the audio data is not float type, the adaptive weighted mixing algorithm, the linear superposition averaging and other mixing algorithms can be adopted to process each audio data so as to obtain the target audio data.

In addition, in the mixing process, the number of times of mixing can be selected in the mixing process according to the type of the sound object. For example, the sounds of sound objects such as cicada and bird calls are short, so that audio data of the sound objects can be input for a plurality of times at random time in the mixing process for mixing.

For the sound object of the background noise class, when the duration of the corresponding audio data is long enough, the sound object can be input once in the process of mixing; when the duration of the corresponding audio data is shorter, the audio data can be input for a plurality of times in the process of mixing, and the adjacent two audio data are connected end to end, namely, the playing end time of the first audio data is the playing start time of the second audio data, so that the background noise sound with enough duration is obtained.

In some embodiments, the electronic device 100 may present the user with an identification of the individual sound objects that make up the target audio data, as well as an identification of the sound objects that the user may currently add, while playing the target audio data. In this way, the user can choose to add or delete sound objects according to his own needs. For example, as shown in fig. 3, the electronic device 100 may display a currently playing sound object (i.e., a sound object that makes up the target audio data) at the control 31, and an addable sound object at the control 32. With continued reference to fig. 3, the user may select to delete a sound object at sub-control 33 in control 31 and/or select to add a sound object at sub-control 34 in control 32.

When the user selects to delete one or more sound objects or selects to add one or more sound objects, the electronic device 100 may re-synthesize the sound object selected by the user to be played, so as to obtain audio data required by the user. For example, with continued reference to fig. 3, when the user deletes "buzzing", "bird call", "breeze", and selects to add "falling stone", "gust", the sound object desired to be played by the user is: "sunny day", "leaves ", "gust", "running water", "falling stone". After the user selects, the electronic device 100 may synthesize audio data of the sound objects (i.e., "sunny day", "leaf ", "gust", "flowing water", "falling stone") desired to be played by the user, so as to obtain new target audio data, and play the new target audio data.

In some embodiments, after the electronic device 100 acquires the environmental sound, it may determine whether to transmit the environmental sound based on the set transmission policy. By way of example, a transparent ambient sound may be understood to be a play ambient sound.

By way of example, the pass-through policy may include: isolating all, part of or none of the ambient sounds. The transparent policy may be selected by the user, and at this time, a mechanical key or a virtual key for selecting the transparent policy may be provided on the electronic device 100, so that the user may select the transparent policy according to his own needs. In addition, the transparent policy may also be determined by the electronic device 100, for example, when the environmental noise is greater than the first noise value, the transparent policy selectable by the electronic device 100 may be to isolate all environmental sounds; when the environmental noise is greater than the second noise value and less than the first noise value, the transparent transmission policy that the electronic device 100 may select may be to isolate a portion of the environmental sound; when the ambient noise is less than the second noise value, the transmission strategy that the electronic device 100 may select may be to not isolate ambient sound.

When the transmission policy is to isolate all of the environmental sounds, the electronic device 100 may discard the environmental sounds, i.e. not play the environmental sounds.

When the transmission policy is to isolate a part of the sound in the environmental sound, the electronic device 100 may input the environmental sound into a pre-trained sound separation model, so as to extract the audio data corresponding to each sound object included in the environmental sound from the sound separation model. After the electronic device 100 obtains the audio data corresponding to each sound object included in the environmental sound, a part of the audio data corresponding to the sound objects may be discarded therefrom, and the remaining audio data corresponding to the sound objects may be synthesized with the determined audio data of each sound object to obtain target audio data, and the target audio data may be played, so that the audio data in the real environment may be fused with the audio data determined from the atomic database, so that the user may more truly feel the external environment.

For example, the electronic device 100 may determine a sound theme adapted to the environmental data based on the environmental data. Each sound theme comprises at least one sound object associated with the sound theme. When a certain sound object included in the environmental sound is not included in the sound theme adapted to the environmental data, the electronic device 100 may discard the audio data corresponding to the sound object. When a sound object included in the environmental sound is included in the sound theme adapted to the environmental data, the electronic device 100 may retain audio data corresponding to the sound object. For example, if the determined sound theme is "Charpy in summer", the sound object included in the sound theme is "Charpy", "night and clear", "breeze", "flowing water", and if the sound object included in the environmental sound is "Charpy", "falling stone", the electronic device 100 may retain the audio data corresponding to "Charpy" in the environmental sound and discard the audio data corresponding to "falling stone" in the environmental sound.

Further, in order to truly restore the environmental sound, the electronic device 100 may adjust the gain of each channel in the audio data corresponding to the extracted sound object. For example, when the audio data of the extracted sound object is a wind sound, the electronic device 100 may increase the loudness of the wind sound.

In addition, after extracting the audio data of the sound object from the environmental sound, the electronic device 100 may further mark the sound object corresponding to each audio data. Meanwhile, the electronic device 100 may reject the same object as each sound object marked at this time from among the previously determined currently required sound objects. Therefore, the subsequent synthesis of similar audio data is avoided, and the quality of the synthesized audio data is improved. For example, the respective sound objects currently required that are determined as described above are: when the tree, the bird call, and the daytime are clear, and the sound object corresponding to the audio data extracted from the environmental sound is the bird call, the electronic device 100 may reject the "bird call" in the determined currently required sound object.

As a possible implementation manner, before removing a certain sound object determined above, the electronic device 100 may further determine whether the audio data amplitude value or the like corresponding to the sound object in the environmental sound is satisfactory. When the requirements are met, the determined certain sound object can be removed, otherwise, the audio data corresponding to the determined certain sound object is reserved, the audio data corresponding to the sound object in the environment sound is removed, or the audio data corresponding to the sound object in the environment sound is adjusted to meet the requirements, and the determined certain sound object is removed.

For example, if the determined sound object (i.e., the sound object obtained from the environmental data in S202) is "cicada sound", the audio data corresponding to "cicada sound" may be extracted from the environmental sound. At this time, if the electronic device 100 determines that the amplitude of the audio data corresponding to "cicada slough" that can be extracted from the environmental sound is lower than the preset value, the electronic device 100 may discard the audio data corresponding to "cicada slough" that can be extracted from the environmental sound; alternatively, the electronic device 100 may adjust the amplitude of the audio data corresponding to "cicada slough" that may be extracted from the environmental sound so that the amplitude is higher than a preset value, and discard the audio data corresponding to the determined sound object. If the electronic device 100 determines that the amplitude of the audio data corresponding to "cicada slough" that can be extracted from the environmental sound is higher than the preset value, the electronic device 100 may retain the audio data corresponding to "cicada slough" that can be extracted from the environmental sound, and discard the audio data corresponding to the determined sound object.

When the transparent policy is not to isolate the environmental sound, the electronic device 100 may synthesize the environmental sound with the audio data of each sound object determined above to obtain the target audio data, and play the target audio data.

2. One type of audio data is continuously played, and a scene of another type of audio data is sporadically played.

2.1, continuously playing audio data and sporadically playing audio data are played through the same electronic device.

By way of example, fig. 4 illustrates an application scenario in some embodiments of the present application. As shown in fig. 4, during the driving of the vehicle 200 by the driver a toward the destination, the driver a may navigate to the destination using the electronic device 100 located in the vehicle 200. Meanwhile, the driver a can play music using the electronic device 100. That is, navigation-related software (such as GoogleEtc.), and software associated with playing music (such as AppleEtc.). In fig. 4, the electronic device 100 may be a device integrated in the vehicle 200, such as an in-vehicle terminal, or may be a device separate from the vehicle 200, such as a mobile phone of the driver a, etc., which is not limited herein. When the electronic device 100 is integrated in the vehicle 200, the electronic device 100 may directly broadcast audio data that it needs to broadcast using speakers in the vehicle 200. When the electronic device 100 is disposed apart from the vehicle 200, a connection may be established between the electronic device 100 and the vehicle 200 by, but not limited to, short-range communication (e.g., bluetooth, etc.). When the electronic device 100 is separately disposed from the vehicle 200, the electronic device 100 may transmit the audio data to be broadcasted to the vehicle 200 and broadcast the audio data through a speaker on the vehicle 200, or the electronic device 100 may broadcast the audio data to be broadcasted through a built-in speaker.

Generally, when the sound of the navigation broadcasting and the sound of the music playing are concurrent, that is, when the two sounds need to be simultaneously played, the electronic device 100 may reduce the volume of the music playing and play the sound of the navigation at a normal volume. The normal volume can be understood as: the volume before the sound of music play is not reduced during the music play. After the navigation sound broadcasting is completed, the electronic device 100 can restore the volume of the music playing to the normal volume. When the method is based on reducing the music playing sound, the user only has the sound of navigation broadcasting on the listening sense, but the sound of music playing is hardly perceived, namely the music experience of the user is greatly sacrificed.

In view of this, the embodiment of the present application provides a sound processing method, when the sound of the navigation broadcasting and the sound of the music broadcasting are concurrent, so that the user can have better listening experience for the sound of the music broadcasting while obtaining the sound of the navigation broadcasting.

By way of example, fig. 5 illustrates a sound processing method in some embodiments of the present application. In fig. 5, the electronic device 100 may be a device integrated in the vehicle 200, such as an in-vehicle terminal; or may be a device separate from the vehicle 200, such as a driver a's cell phone, etc. In addition, in fig. 5, the electronic device 100 has navigation-related software (e.g., google Etc.), and software related to playing music (such as Apple +.>Etc.) and the user is navigating from one location to another using the electronic device 100 and is playing music using the electronic device 100. As shown in fig. 5, the method may include the steps of:

s501, the electronic device 100 obtains second audio data to be played in the process of playing the first audio data.

In this embodiment, the electronic device 100 may obtain another audio data to be played during the process of playing the another audio data. The first audio data may be music data played by the electronic device 100, and the second audio data may be navigation data required to be played by the electronic device 100.

S502, the electronic device 100 extracts third audio data to be played from the first audio data according to the second audio data, wherein playing time periods corresponding to the second audio data and the third audio data are the same.

In this embodiment, the electronic device 100 may extract, from the first audio data, third audio data to be played according to an initial playing time and a data length of the second audio data, where the initial playing time of the third audio data is the same as the initial playing time of the second audio data, and the data length of the third audio data is the same as the data length of the second audio data. That is, the playing time periods corresponding to the second audio data and the third audio data are the same.

S503, the electronic device 100 performs voice cancellation or voice reduction processing on the third audio data to obtain fourth audio data.

In this embodiment, when the voice cancellation processing is required, the electronic device 100 may input the third audio data into the pre-trained voice cancellation model, and perform the voice cancellation processing on the third audio data to obtain the fourth audio data. When the voice reduction processing is required, the electronic device 100 may input the third audio data into the pre-trained voice reduction model, and perform the voice reduction processing on the third audio data to obtain the fourth audio data. As to whether the voice cancellation process or the voice reduction process is selected, it may be, but not limited to, a preset. Wherein, since the fourth audio data is obtained by processing the third audio data, and the playing time periods corresponding to the second audio data and the third audio data are the same, the playing time periods corresponding to the second audio data and the fourth audio data are also the same.

As a possible implementation, the electronic device 100 may also input the third audio data to the high-pass filter to filter out data of a specific frequency when performing the voice cancellation process or the voice reduction process. The electronic device 100 may then mix the channels of data output via the high pass filter to eliminate human voice. Finally, the electronic device 100 may input the data after channel mixing to a low-pass filter to filter out data of a specific frequency, thereby obtaining fourth audio data.

For example, when channel mixing is performed, taking two channels of the left channel and the right channel as an example, the ratio of audio signals corresponding to the two channels may be set in one channel. For example: the percentage a1 of the original left channel in the new left channel; the percentage a2 of the original right channel in the new left channel; the percentage b1 of the original left channel in the new right channel; the new right channel comprises a percentage b2 of the original right channel. The values of the four numbers a1, a2, a3 and a4 are between-100 and 100, and the new Left channel sampling value newleft=a1×left+a2×right and the new Right channel sampling value newright=b1×left+b2×right.

When the voice cancellation processing is selected, in order to realize subtraction of the left and right channels, four values of channel mixing are respectively: 100, -100, thus generating a stereo waveform with opposite left and right channel waveforms. When two waveforms in one sound channel are added, namely, cancel each other, then the elimination of human voice is completed.

When the voice reduction processing is selected, four values of the channel mix may be changed according to a preset reduction ratio. For example, when the volume of the human voice is selected to be reduced by half, four values of the channel mix may be respectively: 100, -50, -50,100. Thus, when two waveforms in one channel are added, namely, half of the waveforms are cancelled, the reduction of the human voice is completed.

S504, the electronic device 100 determines a first gain to be adjusted for the second audio data according to the second audio data, and adjusts the gain of each channel in the second audio data based on the first gain to obtain fifth audio data.

In this embodiment, the electronic device 100 may first extract the audio features of the second audio data, such as the time domain features. And then, according to the determined audio characteristics, determining a first gain required to be adjusted by the second audio data. The time domain features may include, among other things, loudness, envelope energy, or short time energy, etc.

When the audio feature of the second audio data is the loudness, the amplitude of the waveform at each moment can be determined from the waveform diagram of the second audio data in the time domain, so as to determine the loudness at each moment. Where one amplitude is the loudness of one moment. In addition, specific loudness, such as maximum loudness, etc., may be selected according to demand.

When the audio feature of the second audio data is required to be extracted to be envelope energy, constructing an envelope corresponding to the second audio data based on a waveform diagram of the second audio data in a time domain; and then calculating the area of a graph enclosed by the envelope through integration to obtain the average envelope energy of the second audio data in the time domain, wherein the average envelope energy is the required envelope energy. For example, the amplitude corresponding to each time on the time domain waveform diagram may be compared, and when the amplitude of the latter time is greater than the amplitude of the former time, the line of the peak value of the amplitude between the two times is controlled to rise based on the difference between the two amplitudes and a preset control factor; when the amplitude value of the latter moment is smaller than that of the former moment, controlling the connection line of the peak value of the amplitude values between the two moments to be reduced based on the difference value between the two amplitude values and a preset control factor; and finally, the formed curve is the envelope corresponding to the second audio data. In some embodiments, the envelope may be understood as a plot of the amplitude of the second audio data over time on the time domain waveform map. As shown in fig. 6a, the time domain waveform diagram of the second audio data is illustrated, and at this time, the curve of the envelope corresponding to the second audio data in fig. 6a may be illustrated in fig. 6b, where the area between the curve of the envelope and the horizontal axis in fig. 6b is the average envelope energy corresponding to the second audio data.

When the audio feature of the second audio data is short-time energy, the waveform diagram of the second audio data in the time domain can be used for determining the amplitude of the waveform at each moment, and square summing is carried out on the amplitudes of the waveforms at each moment to obtain the short-time energy of the second audio data.

After the audio feature of the second audio data is acquired, the electronic device 100 may determine the first gain based on the determined audio feature and a preset first gain calculation formula. Illustratively, the first gain calculation formula may be:

g＝w ₁ *(K ₁ -x ₁ )+w ₂ *(K ₂ -x ₂ )+…+w _n *(K _n -x _n ) (equation 1)

Wherein the method comprises the steps ofG is gain; w (w) _n The weight value is the n weight value which is preset; k (K) _n Is the preset nth threshold value; x is x _n Is the maximum of the nth audio feature, e.g., the maximum of loudness, etc.

In some embodiments, when determining the first gain, the electronic device 100 may further perform frame-splitting processing on the second audio data to obtain at least one audio frame. The electronic device 100 may then obtain the loudness and/or short-term energy, etc., corresponding to each audio frame in the manner previously described.

Further, when the audio feature is loudness, a maximum loudness may be selected from the loudness corresponding to each audio frame, and substituted into the above "formula 1", so as to obtain the first gain.

When the audio feature is envelope energy, a maximum envelope energy may be selected from the envelope energies corresponding to the respective audio frames, and substituted into the above "formula 1", so as to obtain the first gain.

When the audio features are short-time energy, a maximum short-time energy can be selected from short-time energy corresponding to each audio frame, and substituted into the above formula 1, so that the first gain can be obtained.

When the audio features are loudness and envelope energy, a maximum loudness can be selected from the loudness corresponding to each audio frame, and a maximum envelope energy can be selected from the envelope energy corresponding to each audio frame, and the two are substituted into the formula 1, so that the first gain can be obtained.

When the audio features are loudness and short-time energy, one maximum loudness can be selected from the loudness corresponding to each audio frame, one maximum short-time energy can be selected from the short-time energy corresponding to each audio frame, and the two short-time energy are substituted into the formula 1, so that the first gain can be obtained.

When the audio features are loudness, envelope energy and short-time energy, a maximum loudness can be selected from the loudness corresponding to each audio frame, a maximum envelope energy is selected from the envelope energy corresponding to each audio frame, a maximum short-time energy is selected from the short-time energy corresponding to each audio frame, and the two are substituted into the formula 1, so that the first gain can be obtained.

After determining the first gain to be adjusted for the second audio data, the electronic device 100 may adjust the gain of each channel in the second audio data based on the first gain to obtain fifth audio data.

In some embodiments, when the maximum loudness value corresponding to the second audio data exceeds a certain value, it indicates that the loudness of the second audio data can meet the requirement. At this time, when the first gain is determined from the second audio data and the unit of the first gain is expressed in decibels, the value of the first gain may be set to 0 to reduce the subsequent calculation amount. Thus, the fifth audio data obtained later is the second audio data.

S505, the electronic device 100 determines a second gain to be adjusted for the fourth audio data according to the fourth audio data, and adjusts the gain of each channel in the fourth audio data based on the second gain to obtain sixth audio data.

In this embodiment, the electronic device 100 may first extract audio features of the fourth audio data, such as time domain features, music theory features, or frequency domain features. And then, determining a second gain required to be adjusted for the fourth audio data according to the determined audio characteristics. Where the temporal features may include loudness and/or short-time energy, etc. The music theory features may include beats, key, chord, pitch, timbre, melody, emotion, etc. The frequency domain features may include spectral energy of a plurality of frequency bands, etc., which are preset.

For determining the time domain features, reference may be made to the description in S504, which is not repeated here.

For determining the music theory feature, the electronic device 100 may input the fourth audio data to a music theory feature determination model trained in advance, to obtain the music theory feature of the fourth audio data. For example, the music theory feature determination model may be obtained by training the audio data for training using a gaussian process model, a neural network model, a support vector machine, or the like. In addition, the mode contained in the fourth audio data can be determined based on a Krumhansl-Schmuckler tonality analysis algorithm. Further, emotion and the like included in the fourth audio data may be determined based on the Thayer emotion model.

For determining the frequency domain features, the electronic device 100 may perform short-time fourier transform (short time fourier transform, STFT) on the fourth audio data, and convert the frame of audio data from the time domain to the frequency domain, to obtain a spectrogram corresponding to the fourth audio data. And obtaining the spectrum energy corresponding to the fourth audio data by the spectrum graph corresponding to the fourth audio data. For example, the fourth audio data may be divided into n frequency bands, where each frequency in each frequency band corresponds to a spectrum energy, and the spectrum energy corresponding to each frequency in each frequency band may be obtained by summing or calculating an average value of the spectrum energies corresponding to the frequency band. For example, as shown in fig. 7, the graph is a spectrogram obtained by performing short-time fourier transform on fourth audio data, the horizontal axis is frequency, and the vertical axis is spectral energy value; the fourth audio data is divided into 3 frequency bands, each frequency in each frequency band corresponds to one spectrum energy, and the spectrum energy corresponding to the corresponding frequency band (such as frequency band 1) can be obtained by summing or average value calculation of the spectrum energy.

After determining the music theory feature and/or the frequency domain feature of the fourth audio data, the second gain to be adjusted for the fourth audio data can be determined based on a preset second gain calculation formula. Illustratively, the second gain calculation formula may be:

g＝w ₁ *x ₁ +w ₂ *x ₂ +…+w _n *x _n (equation 2)

Wherein g is gain, w _n For the preset nth weight value, x _n Is the value of the nth audio feature.

After determining the second gain to be adjusted for the fourth audio data, the electronic device 100 may adjust the gain of each channel in the fourth audio data based on the second gain to obtain sixth audio data.

In some embodiments, when determining the second gain, the electronic device 100 may further perform frame-splitting processing on the fourth audio data to obtain at least one audio frame. The electronic device 100 may then perform a short-time fourier transform (short time fourier transform, STFT) on each audio frame, converting the frame of audio data from the time domain to the frequency domain, resulting in a spectrogram corresponding to each audio frame. And obtaining the frequency spectrum energy corresponding to each audio frame by the frequency spectrum diagram corresponding to each audio frame. Then, an audio frame with the largest frequency spectrum energy can be selected as a required audio frame, and the audio frame is processed by adopting the mode of determining the special features of the time domain and the music theory or the frequency domain features so as to obtain a second gain required to be adjusted by the fourth audio data.

In some embodiments, after determining the second gain based on the foregoing manner, in order to make the sound generated by playing the fifth audio data later more easily perceived, the second gain may be further modified based on a preset linear relationship between the first gain and the second gain, so as to obtain the required second gain. Illustratively, the linear relationship between the first gain and the second gain may be:

g＝g ₁ *K+g ₂

wherein g is the corrected second gain, g ₁ G is the first gain ₂ K is a constant for the second gain before correction.

S506, the electronic device 100 plays the fifth audio data and the sixth audio data simultaneously.

In this embodiment, after the electronic device 100 acquires the fifth audio data and the sixth audio data, the fifth audio data and the sixth audio data may be played simultaneously. Therefore, the user can clearly perceive the information contained in the original second audio data and can clearly perceive the tune, background sound and the like of the original first audio data, so that the hearing of the user is more effectively met, and the user experience is improved.

In some embodiments, in determining the second gain to be adjusted for the fourth audio data, the second gain may be determined according to the fifth audio data (i.e., the data obtained by adjusting the second audio data based on the first gain) in addition to the manner described in S505.

For example, the second gain may be determined according to the maximum loudness value of the fifth audio data and a ratio between the maximum loudness value of the second audio data and the maximum loudness value of the fourth audio data calculated in real time. Or determining the second gain according to the maximum loudness value of the fifth audio data and the preset ratio between the maximum loudness value of the second audio data and the maximum loudness value of the fourth audio data.

For example, if the ratio between the maximum loudness of the second audio data and the maximum loudness of the fourth audio data is f, the current maximum loudness of the second audio data is a, and the maximum loudness of the fifth audio data is B, f and B can determine that the maximum loudness of the sixth audio data (i.e. the data obtained by adjusting the fourth audio data based on the second gain) is fB. From the difference between fB and a, the loudness value that the fourth audio data needs to adjust can be determined. According to the mapping relation between the loudness value and the gain, a second gain required to be adjusted by the fourth audio data can be determined.

In some embodiments, in S505, after determining a second gain to be adjusted for the fourth audio data, the second gain may be compared with a predetermined gain value (e.g., 0, 0.1, etc.). When the second gain is greater than the preset gain value, it indicates that the sound generated by playing the fourth audio data is smaller, which has less influence on the sound generated by playing the fifth audio data obtained in S504, so that the determined value of the second gain can be updated to the preset gain value. For example, if the unit of the second gain is expressed by a standardized value (for example, a magnification factor, etc.), if the determined value of the second gain is 0.2 and the predetermined gain value is 0.1, the value of the second gain may be adjusted from 0.2 to 0.1.

In some embodiments, in S505, when the gain of each channel in the fourth audio data is adjusted based on the second gain to obtain the sixth audio data, the gain to be adjusted may be gradually adjusted from a preset value (such as 0, 1, etc.) to the second gain in a certain step size within a preset time period from the time point when the playing starts after starting playing, and the gain to be adjusted from the second gain to a preset value (such as 0, 1, etc.) in a certain step size within a preset time period from the time point when the playing ends before ending playing. Therefore, when the sixth audio data is in transition to be played, or when the sixth audio data is in transition to be played, other data in the first audio data are played, the situation that the volume is suddenly changed is avoided, and user experience is improved.

In addition, the gain to be adjusted may be gradually adjusted from a preset value (e.g., 0, 1, etc.) to the second gain in a certain step size within a preset time period before starting playing and apart from the time when starting playing, and the gain to be adjusted is adjusted from the second gain to a preset value (e.g., 0, 1, etc.) in a certain step size within a preset time period after ending playing and apart from the time when ending playing. Therefore, when the sixth audio data is in transition to be played, or when the sixth audio data is in transition to be played, other data in the first audio data are played, the situation that the volume is suddenly changed is avoided, and user experience is improved.

The gain to be adjusted may be gradually adjusted from a preset value (e.g. 0, 1, etc.) to a second gain in a certain step size within a preset time period before starting playing and apart from the time when starting playing, and the gain to be adjusted is adjusted from the second gain to a preset value (e.g. 0, 1, etc.) in a certain step size within a preset time period before ending playing and apart from the time when ending playing. Therefore, when the sixth audio data is in transition to be played, or when the sixth audio data is in transition to be played, other data in the first audio data are played, the situation that the volume is suddenly changed is avoided, and user experience is improved.

The gain to be adjusted may be gradually adjusted from a preset value (e.g. 0, 1, etc.) to a second gain in a certain step length within a preset time period after the start of playing and from the time point when the start of playing, and the gain to be adjusted is adjusted from the second gain to a preset value (e.g. 0, 1, etc.) in a certain step length within a preset time period after the end of playing and from the time point when the end of playing. Therefore, when the sixth audio data is in transition to be played, or when the sixth audio data is in transition to be played, other data in the first audio data are played, the situation that the volume is suddenly changed is avoided, and user experience is improved.

2.2, continuously played audio data and sporadically played audio data are played through different electronic devices.

By way of example, fig. 8 illustrates another sound processing method in some embodiments of the present application. In fig. 8, the first device and the second device are separate devices, and the first device and the second device may be connected by, but not limited to, short-range communication such as bluetooth. In fig. 8, the first device is configured with software (e.g. AppleEtc.) or the first device may be a device capable of continuously playing audio data, such as a smart television, a smart box, etc., and the first device is playing audio data using its own speakers. The second device is provided with software (such as call,/-for example) which is capable of sporadically playing audio data>Google/>Etc.); wherein the sound generated on the second device is played through a speaker that is owned by itself. The first device may be an intelligent television, an intelligent sound box, a vehicle-mounted terminal, or the like; the second device may be a mobile phone, a tablet computer, etc. As shown in fig. 8, the method may include the steps of:

s801, when the second device needs to broadcast audio data, the second device sends a first message to the first device, wherein the first message is used for indicating the first device to execute voice elimination or voice reduction operation.

In this embodiment, when the second device needs to broadcast audio data, the second device may send a first message to the first device to instruct the first device to perform a voice cancellation or voice reduction operation.

Illustratively, in a home scenario, the second device may be a mobile phone, and the first device may be a smart speaker, a smart television, or the like. In this scenario, the first device may be playing music, a television play, a movie, or the like, and the audio data to be broadcasted by the second device may be audio data to be broadcasted by the second device when the user uses the second device to make a call. That is, in a home scenario, when a user needs to make a call using the second device (e.g., when the second device receives an incoming call, or when the user makes an incoming call on the second device), the second device may send a message to the first device indicating that the first device is performing a voice cancellation or voice reduction operation.

In a driving scenario, the second device may be a mobile phone (e.g., the electronic device 100 shown in fig. 5), and the first device may be an in-vehicle terminal (e.g., the vehicle 200 shown in fig. 4). In this scenario, the first device may be playing music, and the audio data to be played by the second device may be audio data to be played by the second device when the user uses the second device to navigate or talk. That is, in a driving scenario, when the second device needs to play navigation audio data or the user needs to talk using the second device, the second device may send a message to the first device indicating that the first device performs a voice cancellation or voice reduction operation.

S802, the first device responds to the first message to perform voice elimination or voice reduction operation on the audio data to be played.

In this embodiment, when the voice cancellation operation is selected, the first device may perform the voice cancellation operation on the audio data to be played by using the voice cancellation method described above. When the voice reduction operation is selected, the first device can perform the voice reduction operation on the audio data to be played by the first device in the voice reduction mode.

In some embodiments, when the audio data to be played by the second device is navigation audio data, the initial playing time and the data length of the navigation audio data may be included in the first message. The first device may extract sub-data equal to the initial playing time and the data length from the audio data to be played after the first message is acquired, and perform a voice cancellation operation on the sub-data, where the initial playing time of the sub-data is the same as the initial playing time of the navigation audio data, and the data length of the sub-data is equal to the length of the navigation audio data.

S803, the second device broadcasts the audio data, and the first device broadcasts the audio data after the voice is eliminated or reduced.

S804, when the second device finishes broadcasting the audio data, the second device sends a second message to the first device, wherein the second message is used for indicating the first device to stop executing the voice eliminating or voice reducing operation.

In this embodiment, when the second device finishes broadcasting the audio data, the second device sends a second message to the first device, where the second message is used to instruct the first device to stop performing the voice cancellation or voice reduction operation.

For example, when a user makes a call using the second device, the second device may inform the first device of the state in which the user ends the call when the user ends the call (e.g., when the user hangs up), so that the first device may stop performing the voice cancellation or voice reduction operation. When the user uses the second device to navigate, the second device can inform the first device of the state of ending the navigation broadcasting when the second device ends the navigation broadcasting, so that the first device can stop executing the voice eliminating or voice reducing operation.

S805, the first device responds to the second message, stops performing voice elimination or voice reduction operation on the audio data to be played, and plays the audio data without voice elimination or voice reduction.

In this way, in the process of broadcasting the audio data by the second device, the interference of the audio data played by the first device can be reduced, so that the user can clearly perceive the audio data played by the second device.

3. A scene of audio data is played using speakers provided in a space.

3.1 a plurality of loudspeakers are arranged in space, and at least some of the loudspeakers are arranged according to certain requirements (e.g. 5.1.X, or 7.1.X, etc.), and the electronic device or other device is playing audio data using the loudspeakers.

By way of example, fig. 9 (a) shows an application scenario in some embodiments of the present application. As shown in fig. 9 (a), a speaker may be disposed at a fixed position in a room according to a 5.1.X requirement so that a user can enjoy a sound of a great courtyard level. Wherein in 5.1.X, 5 represents the number of speakers constructing a spatial surround sound, 1 represents a subwoofer, and X represents the number of speakers that need to be provided at the top of the room. In fig. 9 (a), the speaker 201 is arranged directly in front of the position where the user a is located; the speaker 202 is disposed in the right front of the user a, for example, the speaker 202 may be disposed at a position shifted to the right by 30 degrees with respect to the line between the position where the user a is located and the speaker 201 as a reference line and with respect to the position where the user a is located as a center; the speaker 203 is disposed at the rear right of the user a, for example, the speaker 203 may be disposed at a position offset 120 degrees to the right with respect to the line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center; the speaker 204 is disposed at the left rear of the user a, for example, the speaker 204 may be disposed at a position offset to the left by 120 degrees with respect to a line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center; the speaker 205 is disposed in the left front of the user a, for example, the speaker 205 may be disposed at a position offset to the left by 30 degrees with respect to the line between the position where the user a is located and the speaker 201 as a reference line and with respect to the position where the user a is located as a center. By adjusting the gains of the audio signals output by the speakers 201, 202, 203, 204, and 205, the user a can enjoy spatial surround sound at the current location.

Fig. 9 (B) shows another application scenario in some embodiments of the present application. As shown in fig. 9 (B), speakers may be arranged at fixed positions in a room according to a 7.1.X requirement so that a user can enjoy extreme sound at the courtyard level. In fig. 9 (B), the speaker 201 is arranged directly in front of the position where the user a is located; the speaker 202 is disposed in the right front of the user a, for example, the speaker 202 may be disposed at a position shifted to the right by 30 degrees with respect to the line between the position where the user a is located and the speaker 201 as a reference line and with respect to the position where the user a is located as a center; the speaker 203 is disposed right of the user a, for example, the speaker 203 may be disposed at a position shifted to the right by 90 degrees with respect to a line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center; the speaker 204 is disposed at the rear right of the user a, for example, the speaker 204 may be disposed at a position deviated to the right by 150 degrees with respect to the line between the position of the user a and the speaker 201 as a reference line and with the position of the user a as a center; the speaker 205 is disposed at the left rear of the user a, for example, the speaker 205 may be disposed at a position offset to the left by 150 degrees with respect to a line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center; the speaker 206 is disposed right-left of the user a, for example, the speaker 206 may be disposed at a position offset to the left by 90 degrees with respect to a line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center; the speaker 207 is disposed in the left front of the user a, for example, the speaker 207 may be disposed at a position offset to the left by 30 degrees with respect to the line between the position of the user a and the speaker 201 as a reference line and with respect to the position of the user a as a center. By adjusting the gains of the audio signals output by the speakers 201, 202, 203, 204, 205, 206, and 207, the user a can enjoy spatial surround sound at the location where the current period is located.

But in fig. 9 user a will not be able to enjoy spatial surround sound at other locations when user a is away from where it is currently located.

In order to enable a user to enjoy space surround sound at any time and any place, the embodiment of the application provides a sound processing method, which can adjust the gain of audio signals output by each loudspeaker based on the distance between the user and each loudspeaker, so that the user can enjoy space surround sound at any time and any place.

By way of example, fig. 10 (a) shows yet another application scenario in some embodiments of the present application. The scene shown in (a) of fig. 10 is mainly different from the scene shown in fig. 9 in that: an image pickup device such as a camera 300 is arranged in the space shown in fig. 10 (a), and/or the user a carries the electronic apparatus 100.

In the scenario shown in fig. 10 (a), the camera 300 may acquire an image of the user a in space to determine the distance between the user a and each speaker from the acquired image.

In some embodiments, the camera 300 may be connected to a controller (not shown) for controlling each speaker through a wired network or a wireless network (such as bluetooth, etc.), so that the camera 300 may transmit the image acquired by the camera to the controller, so that the controller processes the image, such as inputting the image into a pre-trained image processing model, and the controller outputs the distance between the user a and each speaker according to the model. By way of example, the image processing model may be, but is not limited to being, trained based on convolutional neural networks (convolutional neural network, CNN). In other embodiments, the camera 300 may be connected to the electronic device 100 through a wireless network (such as bluetooth, etc.), so that the camera 300 may transmit the image acquired by the camera to the electronic device 100, so that the electronic device 100 may process the image, such as inputting the image into a pre-trained image processing model, and outputting, by the electronic device 100, the distance between the user a and each speaker according to the model.

In some embodiments, the electronic device 100 may establish a connection with the various speakers via a wireless network (e.g., bluetooth, etc.). At this time, in addition to the distance between the user a and each speaker, which can be determined by the image acquired by the camera 300, it can be determined based on the wireless communication signals between the electronic device 100 and each speaker, for example: the distance between the electronic device 100 and the respective speakers may be determined by a ranging method based on the strength indication (received signal strength indication, RSSI) of the received signal. Since the electronic device 100 is carried by the user a, the distance between the electronic device 100 and each speaker, i.e., the distance between the user a and each speaker, is determined. It should be understood that the execution subject for determining the distance between the user a and each speaker may be the electronic device 100, or may be a controller (not shown in the figure) for controlling each speaker, which is not limited herein. For example, when the electronic device 100 determines the distance between the electronic device 100 and a speaker for the execution subject, the distance between the electronic device 100 and a speaker may be determined by the following "formula one", which is:

Where d is the distance between the electronic device 100 and the speaker; abs is an absolute function; the RSSI is the RSSI corresponding to the message sent by the loudspeaker and acquired by the electronic equipment 100; a is RSSI corresponding to a message sent by a loudspeaker and acquired by the electronic equipment 100 when the electronic equipment 100 is 1 meter away from the loudspeaker, and the value can be calibrated in advance; n is an environmental attenuation factor, which may be an empirical value. When the controller for controlling each speaker determines the distance between the electronic device 100 and the speaker for the execution subject, the manner of determining the distance between the electronic device 100 and the speaker for the execution subject may be referred to by the electronic device 100, and will not be described herein.

In some embodiments, after determining the distance between the electronic device 100 and the speakers, the distance between the electronic device 100 and at least three speakers may be processed using a three-point positioning method to obtain the location of the electronic device 100. In addition, the movement distance of the electronic device 100 can be acquired by the position of the electronic device 100 at different times. Because the electronic device 100 is carried by the user, the moving distance of the electronic device 100 is the moving distance of the user.

By way of example, fig. 10 (B) shows yet another application scenario in some embodiments of the present application. The scene shown in (B) of fig. 10 is mainly different from the scene shown in (a) of fig. 10 in that: in the space shown in fig. 10 (B), other speakers, such as speakers 208, 209, and the like, are also arranged outside the area surrounded by the speakers 201 to 205. In fig. 10 (B), when the user a moves to an area surrounded by the speakers 202, 208, 209, 210, and 203, the generation of spatial surround sound can be controlled in the area. It should be understood that the speakers disposed outside the area surrounded by the speakers 201 to 205 in the space shown in (B) of fig. 10 may be located in a space adjacent to the space in which the speakers 201 to 205 are located, except that the speakers 201 to 205 may be located in a space, which is not limited herein. In addition, fig. 10 shows a scenario in which speakers are configured according to the requirement of 5.1.X, and for a scenario in which speakers are configured according to other requirements, reference may be made to the description in fig. 10, which is not repeated here.

In some embodiments, in the scenario illustrated in fig. 10, user a may configure the position of the camera and/or the respective speakers in space on electronic device 100, and/or configure the identity of the camera and/or the respective speakers, etc. to facilitate a subsequent determination of the distance between user a and the respective speakers, and to facilitate a subsequent selection of the desired speaker. Illustratively, the electronic device 100 may have installed thereon an Application (APP) for configuring the camera and/or speaker, which APP may be logged in by user a for configuration. In other embodiments, in the scenario illustrated in fig. 10, each speaker may automatically identify its location in space based on distance from the electronic device and display in an APP interface installed by the electronic device 100. User a can also adjust the position of the individual speakers in space in the APP.

Next, a sound processing method provided in the embodiment of the present application will be described in detail based on the above description.

By way of example, fig. 11 illustrates a flow of a sound processing method in some embodiments of the present application. In fig. 11, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. In fig. 11, the audio signal played in the speaker may be the audio signal in the electronic device 100, or may be the audio signal in another device, which is not limited herein. In fig. 11, the movement area of the user may be an area surrounded by speakers constructing spatial surround sound, such as: the area surrounded by the speakers 201 to 205 in fig. 10 (a) and the like may be other areas, such as: the area outside the area surrounded by the speakers 201 to 205 in (a) of fig. 10 is not limited here. In addition, the method shown in fig. 11 may be performed in real time, or may be performed when a certain condition is satisfied, for example, when it is detected that the distance moved by the user is greater than a certain threshold, which is not limited herein. As shown in fig. 11, the sound processing method includes the steps of:

S1101, the electronic device 100 determines distances between the electronic device 100 and N speakers to obtain N first distances, where N is a positive integer.

In this embodiment, the electronic device 100 may determine the distances between the image capturing device and each speaker based on the image captured by the image capturing device configured in the space where the user is located, so as to obtain N first distances. In addition, the electronic device 100 may also determine the distances between the electronic device and the respective speakers based on the wireless communication signals between the electronic device and the respective speakers, so as to obtain N first distances. Wherein N is a positive integer. Optionally, N is greater than or equal to 5.

In some embodiments, the N speakers may be speakers configured according to a certain requirement (e.g., 5.1.X or 7.1.X, etc.) to construct spatial surround sound. For example, the N speakers may be speakers 201 to 205 shown in (a) of fig. 10. In other embodiments, the N speakers may be all speakers in space, such as speakers 201 to 205, and speakers 208 to 210 shown in (B) of fig. 10.

S1102, the electronic device 100 screens out a target speaker from the N speakers based on the N first distances, where the distance between the target speaker and the electronic device 100 is the shortest.

In this embodiment, the electronic device 100 may sort the N first distances, for example, from large to small or from small to large, and select the smallest first distance from the N first distances, and use the speaker corresponding to the smallest first distance as the target speaker.

In some embodiments, the target speaker may also be other speakers, for example, a speaker farthest from the electronic device 100, etc., which may be specific to the actual situation, and is not limited herein.

S1103, the electronic device 100 determines gains to be adjusted for audio signals corresponding to the speakers except the target speaker based on the distance between the electronic device 100 and the target speaker, so as to construct a first speaker group, where the first speaker group is a combination of speakers obtained by virtualizing N speakers to a circle centered on the electronic device 100 and having a radius equal to the distance between the electronic device 100 and the target speaker. In some embodiments, an audio signal may be included in the audio data, but is not limited to including audio signals that each corresponding speaker is required to play. For example, each audio signal included in one audio data may correspond to one channel.

In this embodiment, the electronic device 100 may select a distance between the electronic device 100 and the target speaker as a reference, and determine gains to be adjusted for audio signals corresponding to other speakers according to the reference distance and distances between other speakers and the electronic device 100, so as to virtualize the other speakers to a circle with the distance between the electronic device 100 and the target speaker as a radius, thereby constructing the first speaker group.

In some embodiments, if the distance between the electronic device 100 and the target speaker is d1 and the distance between the electronic device 100 and one of the speakers other than the target speaker is d2, the gain gi=d2/d 1 of the audio signal corresponding to the one speaker is required to be adjusted. In addition, when determining the gain to be adjusted for the audio signal corresponding to the other speaker, the electronic device 100 may also select another linear model to determine, for example, the gain to be adjusted for the audio signal corresponding to one of the speakers may be gi=q (d 2/d 1) +p, where Q and P are constants, and the present invention is not limited thereto.

In addition, in constructing the first speaker group, the electronic device 100 may record the gain that needs to be adjusted for each real speaker corresponding to the audio signal, so as to obtain the first gain set.

S1104, the electronic device 100 uses the current direction as a reference, and constructs a virtual speaker group based on the first speaker group, where the virtual speaker group is composed of M virtual speakers, the value of M is equal to the number of speakers required for constructing space surround sound, and the arrangement manner of each virtual speaker in the virtual speaker group is the same as the arrangement manner of speakers required for constructing space surround sound.

In this embodiment, the electronic device 100 may determine a virtual speaker in its direction based on the first speaker group, and then determine the remaining virtual speakers based on a predetermined speaker arrangement (such as 5.1.X or 7.1.X arrangement) required for building space surround sound, so as to build the virtual speaker group. Wherein a virtual speaker may be understood as a virtual speaker.

In some embodiments, when there is one speaker in the first speaker group located in the direction of the electronic device 100 or one speaker within a preset angle range of its direction, the speaker may be determined as a center speaker in the virtual speaker group. A center speaker may be understood as a speaker that is oriented in the electronic device 100 and in a 0 degree direction, such as speaker 201 shown in fig. 10 (a). Illustratively, the orientation of the electronic device 100 may be understood as the direction from the bottom of the electronic device 100 toward the top thereof. For the top and bottom of the electronic device 100, taking the electronic device 100 as a mobile phone as an example, as shown in fig. 12, the position of the earpiece 1201 of the mobile phone may be the top of the mobile phone, the position 1202 on the mobile phone opposite to the earpiece 1201 may be the bottom of the mobile phone, and the direction indicated by the arrow 1203 is the direction of the mobile phone. Alternatively, when the display screen of the electronic device 100 is not parallel to the horizontal plane, the orientation of the electronic device 100 may be determined by the projection of the electronic device 100 on the horizontal plane, and the orientation of the electronic device 100 may be the direction of the projection of the bottom of the electronic device 100 on the horizontal plane toward the projection of the top thereof on the horizontal plane.

When there is no speaker in the first speaker group located in the upward direction of the electronic apparatus 100 or there is no speaker in the preset angle range of the upward direction thereof, one speaker may be virtually formed by two speakers located adjacent to each other in the upward and rightward directions in the first speaker group, and the virtually formed speaker may be regarded as a center speaker in the virtual speaker group. When a virtual speaker is virtually formed by two speakers in the first speaker group, the virtual speaker can be virtually formed by adjusting the gains of the audio signals corresponding to the two speakers in the first speaker group. For example, with reference to fig. 12, when the preset angle is α, the preset angle range in the orientation of the electronic device 100 is an area constructed by the angle α. For example, a vector base amplitude panning (vector base amplitude panning, VBAP) algorithm may be utilized to virtualize one speaker from two speakers. It should be appreciated that when there is one speaker in the orientation of the electronic device 100, it is also understood that one speaker is virtualized, except that this virtual speaker is essentially one speaker of the first speaker group.

For example, as shown in fig. 13, if the first speaker group includes speakers SP1 and SP2, the speakers SP1 and SP2 are on a circle centered on the user U11, and the current direction of the user U11 is the direction indicated by the vector P. In this case, the sound may be fixed at the position of the virtual speaker VSP1 using the positions of the speakers SP1 and SP 2. For example, in the case where the position of the user U11 is taken as the origin O, it has a two-dimensional coordinate system in which the vertical direction and the horizontal direction are taken as the x-axis direction and the y-axis direction, respectively. In this two-dimensional coordinate system, the position of the virtual speaker VSP1 may be represented by a vector P. Since the vector P is a two-dimensional vector, the vector P can be represented by a linear sum of vectors L1 and L2 extending in the direction of the speaker SP1 and the direction of the speaker SP2, respectively, with the origin O as a starting point, that is, p=g1l1+g2l2. After calculating g1 and g2, the sound can be fixed at the position of the virtual speaker VSP1 with the coefficient g1 as the gain of the audio signal corresponding to the speaker SP1 and the coefficient g2 as the gain of the audio signal corresponding to the speaker SP 2. In fig. 13, by adjusting the values of g1 and g2, the virtual speaker VSP1 can be located at an arbitrary position on the arc AR11 connecting the speakers SP1 and SP 2.

After determining the virtual center speaker in the virtual speaker group, the remaining virtual speakers may be determined according to a preset speaker arrangement mode required for building space surround sound, thereby building the virtual speaker group. For example, taking the example of constructing a 5.1.X required set of virtual speakers, after determining the center speaker, the virtual speakers of the user U11 right front, right rear, left rear, and left front may be determined. Optionally, as described above, the virtual speaker at the right front of the user U11 is located at a position that is offset to the right by 30 degrees with respect to the line between the position of the user U11 and the speaker VSP1 as the reference line and with respect to the position of the user U11 as the center of the circle; the virtual speaker at the right rear of the user U11 is positioned at a position which is offset to the right by 120 degrees by taking the connecting line between the position of the user U11 and the speaker VSP1 as a datum line and taking the position of the user U11 as the circle center; the loudspeaker at the left rear of the user U11 is positioned at a position which is offset to the left by 120 degrees by taking the connecting line between the position of the user U11 and the loudspeaker VSP1 as a datum line and taking the position of the user U11 as the circle center; the speaker at the left front of the user U11 is located at a position offset to the left by 30 degrees with respect to the line between the position of the user U11 and the speaker VSP1 as a reference line and with respect to the position of the user U11 as a center. When the remaining virtual speakers are determined, when no speakers exist in the specific angle or the specific angle range, virtual speakers are obtained through the left speaker and the right speaker, and the method of determining the center speaker can be specifically referred to and will not be described in detail herein.

In the process of constructing the virtual speaker group according to the first speaker group, the gain required to be adjusted for the audio signal corresponding to each speaker in the first speaker group can be recorded, so as to obtain a second gain set.

In some embodiments, after the virtual speaker group is built, when the number of virtual speakers is obtained to be less than the number of speakers required for building the spatial surround sound, the required speakers may also be virtualized from the obtained virtual speakers. The method for virtually generating the required speaker from the obtained virtual speaker may refer to the method for determining the center speaker, which is not described herein.

S1105, the electronic device 100 controls the virtual speaker group to play audio data.

In this embodiment, after the electronic device 100 constructs the virtual speaker group, the virtual speaker group may be controlled to play audio data. The audio data played by the virtual speaker group can be obtained by adjusting the gain of each channel in the audio data according to the determined first gain set and the determined second gain set.

For example, as shown in fig. 14 (a), two speakers SP1 and SP2 are arranged in space, and the distance between the speaker SP1 and the user U11 (i.e., the electronic apparatus 100) is d1, and the distance between the speaker SP2 and the user U11 is d 2. As shown in fig. 14 (B), when the first speaker group is constructed, the speaker SP2 may be virtual to the circle C1 with d1 as a reference, and the speaker SP2' may be obtained. Next, as shown in fig. 14 (C), one virtual speaker VSP1 can be virtualized from the speakers SP1 and SP2' in constructing the virtual speaker group.

In fig. 14 (B), assuming that the gain to be adjusted for the audio signal corresponding to the speaker SP2 is determined as g1, since d1 is the reference, the gain of the audio signal corresponding to the speaker SP1 may not be adjusted, and in this case, the gain to be adjusted for the audio signal corresponding to the speaker SP1 may be set as g0. Wherein, when g0 is expressed in Decibel (DB), the value thereof may be 0; when the unit of g0 is a standardized value (such as a magnification, etc.), the value may be 1. Therefore, the first set of gains obtained in (B) of fig. 14 is: the gain to be adjusted for the audio signal corresponding to the speaker SP1 is g0, and the gain to be adjusted for the audio signal corresponding to the speaker SP2 is g1.

In fig. 14 (C), it is assumed that the determined gain to be adjusted for the speaker SP1 is g2 and the gain to be adjusted for the speaker SP2' is g3. Therefore, the second set of gains obtained in (C) of fig. 14 is: the gain to be adjusted for the audio signal corresponding to speaker SP1 is g2 and the gain to be adjusted for the channel corresponding to speaker SP2' is g3.

From the first gain set determined in (B) of fig. 14 and the second gain set determined in (C) of fig. 14, it may be determined that the gain of the audio signal corresponding to the speaker SP1 that is ultimately required to be adjusted is g2, the gain of the audio signal corresponding to the speaker SP2 that is ultimately required to be adjusted is gi=g1×g3, or gi=g1+g3, or the like. Wherein, when the unit of the gain to be adjusted is decibel, the addition mode can be adopted, and when the unit of the gain to be adjusted is standardized value, the multiplication mode can be adopted.

Finally, the electronic device 100 may adjust the gain of the corresponding channel in the audio data based on determining the gain to be adjusted for the audio signal corresponding to each real speaker, so as to obtain the required audio data, and send the signal corresponding to the corresponding channel to the corresponding speaker, so as to make the sound feel as if it were generated by playing through the virtual speaker group. In this way, the sound perceived by the user is approximately generated at his or her side, so that the user can enjoy the spatial surround sound at any time and any place.

In some embodiments, when the distance between the user and the speaker is greater than the preset distance threshold, the corresponding time delay of each speaker may be determined, so that each speaker may play the same audio data synchronously. For example, a maximum first distance may be selected as a reference, and the time delays of the other speakers may be determined from the distance. For example, if the determined reference distance is d1 and the distance between the electronic device 100 and one of the virtual speakers is d2, the delay of the one speaker= (d 1-d 2)/v, where v is the propagation speed of sound in air.

After determining the corresponding time delays for the speakers, the electronic device 100 may control the speakers to play audio data according to the corresponding time delays.

In this way, the user adjusts the gain of each loudspeaker at any time and any place along with the movement of the user in the moving process, so that the user can enjoy the space surround sound at any time and any place.

In order to facilitate understanding of the above-described scheme, the following examples are illustrated.

As illustrated in fig. 15 (a), for example, the electronic device 100 used by the user is disposed in 5 speakers, i.e., speakers SP1, SP2, SP3, SP4, and SP5, in space, and the user moves in an area surrounded by the 5 speakers.

In fig. 15 (B), the position of the electronic device 100 is switched from the position a1 to the position a2, and at this time, the execution of the method in fig. 11 described above is triggered. Wherein it may be assumed that the speaker SP2 is oriented in the electronic device 100.

In fig. 15 (C), since the distance between the electronic device 100 and the speaker SP2 is the shortest, this distance may be selected as the reference distance, and the speakers SP1, SP3, SP4, and SP5 may be each virtual to a circle C1 having the reference distance as a radius and the position a2 as the center. In fig. 15 (C), the virtual speaker corresponding to the speaker SP1 is SP1', the virtual speaker corresponding to the speaker SP3 is SP3', the virtual speaker corresponding to the speaker SP4 is SP4', and the virtual speaker corresponding to the speaker SP5 is SP5'.

In fig. 15 (D), the electronic apparatus 100 can construct a virtual speaker group according to the requirement of 5.1. X. The virtual speaker group is composed of speakers SP2, VSP1, VSP2, SP4', and SP 1'. Wherein speaker VSP1 is virtually derived from speakers SP2 and SP3', and speaker VSP2 is virtually derived from speakers SP2 and SP 1'. It is understood that speakers SP2, SP4 'and SP1' are located at angles or within a range of angles that satisfy the condition.

Finally, the electronic device 100 may control the virtual speaker group to play audio data.

As illustrated in fig. 16 (a), for example, the electronic apparatus 100 used by the user is disposed at the position a1 in the space at 7 speakers, i.e., speakers SP1, SP2, SP3, SP4, SP5, SP6, and SP 8.

In fig. 16 (B), the position of the electronic device 100 is switched from the position a1 to the position a2, and at this time, the execution of the method in fig. 11 described above is triggered. Wherein it may be assumed that the speaker SP5 is oriented in the electronic device 100.

In fig. 16 (C), since the distance between the electronic device 100 and the speaker SP5 is the shortest, this distance may be selected as the reference distance, and the speakers SP1, SP2, SP3, SP4, SP6, and SP8 may be each virtual to a circle C1 having the reference distance as a radius and the position a2 as the center. In fig. 16 (C), the virtual speaker corresponding to the speaker SP1 is SP1', the virtual speaker corresponding to the speaker SP2 is SP2', the virtual speaker corresponding to the speaker SP3 is SP3', the virtual speaker corresponding to the speaker SP4 is SP4', the virtual speaker corresponding to the speaker SP6 is SP6', and the virtual speaker corresponding to the speaker SP8 is SP8'.

In fig. 16 (D), the electronic apparatus 100 can construct a virtual speaker group according to the requirement of 5.1. X. The virtual speaker group is composed of speakers SP5, VSP1, SP6', VSP2, and SP 3'. Wherein speaker VSP1 is virtually derived from speakers SP1 'and SP6', and speaker VSP2 is virtually derived from speakers SP8 'and SP 4'. It is understood that speakers SP5, SP6 'and SP3' are located at angles or within an angle range that satisfies the condition.

By way of example, fig. 17 illustrates a flow of a sound processing method in some embodiments of the present application. In fig. 17, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. In fig. 17, the audio signal played in the speaker may be the audio signal in the electronic device 100 or may be the audio signal in another device, which is not limited herein. In fig. 17, the movement area of the user may be an area surrounded by speakers constructing spatial surround sound, such as: the area surrounded by the speakers 201 to 205 in fig. 10 (a) and the like may be other areas, such as: the area outside the area surrounded by the speakers 201 to 205 in (a) of fig. 10 is not limited here. In addition, the method shown in fig. 17 may be performed in real time, or may be performed when a certain condition is met, for example, when it is detected that the distance moved by the user is greater than a certain threshold, which is not limited herein. As shown in fig. 17, the sound processing method includes the steps of:

S1701, the electronic device 100 determines distances between the electronic device 100 and N speakers to obtain N first distances, where N is a positive integer.

S1702, the electronic device 100 builds a first virtual speaker group based on its orientation, where the first virtual speaker group is composed of M virtual speakers, and the value of M is the same as the number of speakers required to build the spatial surround sound.

In this embodiment, the electronic device 100 may determine one virtual speaker in the direction of the virtual speaker, and then sequentially determine the remaining virtual speakers based on a preset speaker arrangement mode (such as a 5.1.X or 7.1.X arrangement mode) required for building space surround sound, so as to build the first virtual speaker group.

In some embodiments, when there is one speaker in the orientation of the electronic device 100, or one speaker within a preset angular range in its orientation, the speaker may be designated as a center speaker in the virtual speaker group.

When there is no speaker in the direction of the electronic device 100 or there is no speaker in the preset angle range in the direction, one speaker may be virtually extracted from two speakers adjacent to each other in the direction, and the virtually extracted speaker may be used as a center speaker in the virtual speaker group. When a virtual loudspeaker is virtually formed by two real loudspeakers, a virtual loudspeaker can be virtually formed by adjusting the gains of the two real loudspeakers. Details are described in fig. 13, and are not repeated here.

In addition, when there is no speaker in the direction of the electronic device 100 (or there is no speaker in the preset angle range in the direction), and the distances between the two speakers adjacent to each other in the direction and the electronic device 100 are not equal, the two speakers can be virtualized to a circle centered on the electronic device 100 by adjusting the gain required for adjusting at least one corresponding channel of the two speakers; then, a speaker VSP1 is virtually obtained in the manner shown in fig. 13. For example, with continued reference to fig. 14, as shown in fig. 14 (a), speakers SP1 and SP2 are not simultaneously on a circle centered on user U11 (i.e., electronic device 100), the distances between speakers SP1 and SP2 and user U11 are d1 and d2, respectively, and d1 < d2. In this case, d1 may be used as the radius of the desired circle C1 in one possible implementation. Next, as shown in (B) of fig. 14, the gain to be adjusted for the audio signal corresponding to the speaker SP2 may be adjusted in the manner described in the foregoing S1103, for example, the gain to be adjusted for the audio signal corresponding to the speaker SP2 may be gi=d2/d 1 to virtualize the speaker SP2 onto a circle C1 centered on the user U11 and having d1 as a radius, in which the speaker SP2' is a speaker virtually drawn out by the speaker SP 2. In (B) of fig. 14, d2' =d1. Thereafter, in fig. 14 (C), a virtual speaker VSP1 can be virtualized from the speakers SP1 and SP2' in the manner described in fig. 14. For example, the gain corresponding to the speaker SP1 is g1, and the gain corresponding to the speaker SP2' is g2. For example, if the gain to be adjusted for the audio signal corresponding to the speaker SP2 is gi when the speaker SP2 is virtualized out of the speaker SP2, the gain to be adjusted for the audio signal corresponding to the speaker SP1 is g1 when the speakers SP1 and SP2 'are virtualized out of the VSP1, and the gain to be adjusted for the channel corresponding to the speaker SP2' is g2, in this implementation, the gain of the speaker SP1 is g1, the total gain of the speaker SP2 is the product of gi and g2, or the sum of gi and g2. The gain to be adjusted may be added when the unit of the gain to be adjusted is db, and the gain to be adjusted may be multiplied when the unit of the gain to be adjusted is a standardized value (such as a magnification factor).

In another possible implementation manner, d2 may be taken as the radius of the required circle C1, and further, the gain that is required to be adjusted for the audio signal corresponding to SP1 may be adjusted in the manner described in S1103, and the speaker VSP1 may be further virtualized. In another possible implementation manner, any value in the d1 and d2 ranges may be selected as the radius of the desired circle C1, and further, the gains required to be adjusted for the audio signals corresponding to the speakers SP1 and SP2 may be adjusted in a similar manner as described above, and further, the speaker VSP1 may be finally virtualized. The specific implementation manner may refer to fig. 13 and the description thereof, and the description thereof is omitted herein.

Further, in determining the remaining virtual speakers, reference may be made to a process of determining a center speaker, which is not described herein.

In addition, in constructing the first virtual speaker group, the electronic device 100 may record the gain that needs to be adjusted for the audio signal corresponding to each real speaker, so as to obtain the first gain set.

S1703, the electronic device 100 determines distances between the electronic device 100 and each virtual speaker in the first virtual speaker group, so as to obtain M second distances.

In this embodiment, when the electronic device 100 constructs the first virtual speaker group, the first virtual speaker group is constructed based on the distance between the first virtual speaker group and a certain speaker, and each virtual speaker is formed on a circle centered on itself and having the distance as a radius. Accordingly, the distance between a certain virtual speaker and the electronic device 100 corresponds to the reference selected for constructing the virtual speaker. For example, with continued reference to fig. 14, the virtual speaker is finally determined as VSP1', and is based on the distance d1 between the speaker SP1 and the user U11 (i.e., the electronic device 100), and thus the distance d1 between the virtual speaker VSP1' and the user U11 (i.e., the electronic device 100).

S1704, the electronic device 100 screens out the target speaker from the M virtual speakers based on the M second distances, where the distance between the target speaker and the electronic device 100 is the shortest.

In this embodiment, the electronic device 100 may sort the M second distances, for example, from large to small or from small to large, and select the smallest one of the second distances, and use the virtual speaker corresponding to the smallest second distance as the target speaker.

In some embodiments, the target speaker may also be other virtual speakers, for example, a virtual speaker that is farthest from the electronic device 100, etc., which may be specific to the actual situation, and is not limited herein.

S1705, the electronic device 100 uses the distance between the electronic device 100 and the target speaker as a reference, and constructs a second virtual speaker group based on the first virtual speaker group, wherein the second virtual speaker group is a combination of virtual speakers obtained by virtualizing all M virtual speakers in the first virtual speaker group to a circle with the electronic device 100 as the center and the distance between the electronic device 100 and the target speaker as a radius.

In this embodiment, the electronic device 100 may select a distance between the electronic device 100 and the target speaker as a reference, and determine gains to be adjusted for audio signals corresponding to other virtual speakers according to the reference distance and distances between other virtual speakers and the electronic device 100, so as to adjust all other virtual speakers to a circle with a radius equal to a distance between the electronic device 100 and the target speaker, thereby constructing the second virtual speaker group. In some embodiments, if the distance between the electronic device 100 and the target speaker is d1 and the distance between the electronic device 100 and one of the virtual speakers other than the target speaker is d2, the gain gi=d2/d 1 of the audio signal corresponding to the one virtual speaker is required to be adjusted. In addition, when determining the gain to be adjusted for the audio signal corresponding to the other speaker, the electronic device 100 may also select another linear model to determine, for example, the gain to be adjusted for the audio signal corresponding to one of the speakers may be gi=q (d 1/d 2) +p, where Q and P are constants, and the present invention is not limited thereto.

In the process of constructing the second virtual speaker group according to the first virtual speaker group, the gain required to be adjusted for the audio signal corresponding to each virtual speaker in the first virtual speaker group can be recorded, so as to obtain a second gain set.

S1706, the electronic device 100 controls the second virtual speaker group to play audio data.

In this embodiment, after the electronic device 100 constructs the second virtual speaker group, the second virtual speaker group may be controlled to play audio data. The audio data played by the second virtual speaker group can be obtained by adjusting the gain of each channel in the audio data according to the determined first gain set and the determined second gain set.

For example, as shown in fig. 18 (a), three speakers SP1, SP2, and SP3 are arranged in space, and the speaker SP3 is one speaker with the user U11 (i.e., the electronic apparatus 100) facing upward, and another desired speaker needs to be virtually out by the speakers SP1 and SP 2. As shown in fig. 18 (B), when the first virtual speaker group is constructed, the speaker SP2 can be virtual to the circle C1 with d1 as a reference, and the speaker SP2' can be obtained. Next, as shown in fig. 18 (C), one virtual speaker VSP1 can be virtualized from the speakers SP1 and SP2', and at this time, two speakers in the first virtual speaker group, that is, the speaker SP3 and the virtual speaker VSP1 are constructed. Next, as shown in fig. 18 (D), when the second virtual speaker group is constructed, the virtual speaker VSP1 'may be virtualized to the circle C2 on the basis of D3, thereby constructing two speakers in the second speaker group, i.e., the speaker SP3 and the virtual speaker VSP1'.

In fig. 18 (B), assuming that the gain to be adjusted for the audio signal corresponding to the speaker SP2 is determined as g1, since d1 is the reference, the gain of the audio signal corresponding to the speaker SP1 may not be adjusted, and in this case, the gain to be adjusted for the audio signal corresponding to the speaker SP1 may be set as g0. Wherein, when g0 is expressed in Decibel (DB), the value thereof may be 0; when the unit of g0 is a standardized value (such as a magnification, etc.), the value may be 1. In fig. 18 (C), it is assumed that the determined gain to be adjusted for the speaker SP1 is g2 and the gain to be adjusted for the speaker SP2' is g3. Therefore, in constructing the first virtual speaker group, the first set of gains obtained from (B) and (C) of fig. 18 is: the gain to be adjusted for the audio signal corresponding to speaker SP1 is (g0×g2), or (g0+g2); the gain to be adjusted for the audio signal corresponding to the speaker SP2 is gi=g1×g3, or gi=g1+g3, or the like. Wherein, when the unit of the gain to be adjusted is decibel, the addition mode can be adopted, and when the unit of the gain to be adjusted is standardized value, the multiplication mode can be adopted.

In fig. 18 (D), assuming that the gain to be adjusted for the audio signal corresponding to the virtual speaker VSP1 is determined as g4, since D3 is the reference, the gain of the audio signal corresponding to the speaker SP3 may not be adjusted, and in this case, the gain to be adjusted for the audio signal corresponding to the speaker SP3 may be set as g0. Therefore, in constructing the second virtual speaker group, the second set of gains resulting from (D) of fig. 18 is: the gain to be adjusted for the audio signal corresponding to speaker SP3 is g0 and the gain to be adjusted for the audio signal corresponding to virtual speaker VSP1 is g4.

In fig. 18 (D), the virtual speaker VSP1 'is equivalent to the speakers SP1 and SP2 being virtual onto the circle C2 first, and then the virtual speaker VSP1' is virtual out of the two speakers. Since the speakers SP1, SP2', and VSP1 are on the same circle C1, the gains of the three channels are equal to each other when they are virtually arranged on the circle C2. Thus, from the second gain set obtained in (D) of fig. 18, it is possible to determine the gain to be adjusted for the audio signal corresponding to the real speaker corresponding to the virtual speaker VSP1 when the second virtual speaker group is constructed, and the gain to be adjusted for the channels corresponding to the two real speakers (i.e., speakers SP1 and SP 2) is also g4.

Further, from the first gain set and the second gain set, it may be determined that the gain that is ultimately required to be adjusted for the audio signal corresponding to the speaker SP1 is (g0+g2+g4) or (g0×g2×g4), the gain that is ultimately required to be adjusted for the audio signal corresponding to the speaker SP2 is (gi+g4) or (gi×g4), and the gain that is required to be adjusted for the audio signal corresponding to the speaker SP3 is g0. Wherein the final desired adjusted gain can be obtained by summing the respective gains when the unit of the desired adjusted gain is decibel, and the final desired adjusted gain can be obtained by summing the respective gains when the unit of the desired adjusted gain is a normalized value.

Finally, the electronic device 100 may adjust the gain of the corresponding channel in the audio data based on determining the gain to be adjusted for the audio signal corresponding to each real speaker, so as to obtain the required audio data, and send the signal corresponding to the corresponding channel to the corresponding speaker, so as to make the sound feel as if it were generated by playing through the second virtual speaker group.

In this way, the sound perceived by the user is approximately generated at his or her side, so that the user can enjoy the spatial surround sound at any time and any place.

By way of example, fig. 19 illustrates a flow of a sound processing method in some embodiments of the present application. In fig. 19, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. In fig. 19, the audio signal played in the speaker may be the audio signal in the electronic device 100 or may be the audio signal in another device, which is not limited herein. In fig. 19, the movement area of the user may be an area surrounded by speakers constructing spatial surround sound, such as: the area surrounded by the speakers 201 to 205 in fig. 10 (a) and the like may be other areas, such as: the area outside the area surrounded by the speakers 201 to 205 in (a) of fig. 10 is not limited here. In addition, the method shown in fig. 19 may be performed in real time, or may be performed when a certain condition is satisfied, for example, when it is detected that the distance moved by the user is greater than a certain threshold, which is not limited herein. As shown in fig. 19, the sound processing method includes the steps of:

S1901, the electronic device 100 screens K speakers from the N speakers with its orientation as a reference, where the K speakers are used to construct spatial surround sound.

In this embodiment, the electronic device 100 may determine one speaker in its direction, and then sequentially determine the remaining required speakers based on a preset arrangement manner (such as a 5.1.X or 7.1.X arrangement manner) of the speakers required for constructing space surround sound, so as to obtain K speakers.

In some embodiments, when there is one speaker in the direction of the electronic device 100, or there is one speaker within a preset angular range in the direction thereof, the speaker may be regarded as a desired speaker.

When there is no speaker in the direction of the electronic apparatus 100 or there is no speaker in the preset angle range in the direction, two speakers adjacent to each other in the left and right directions may be used as the desired speakers.

Further, in determining the remaining required speakers, reference may be made to a process of determining the required speakers in the orientation of the electronic device 100, which is not described herein.

S1902, the electronic device 100 constructs a virtual speaker group based on K speakers, where the virtual speaker group is a combination of virtual speakers obtained by virtualizing K speakers all on a circle centered on the electronic device 100. Alternatively, the distance between the electronic device 100 and one of the K speakers may be taken as a radius. The process of constructing the virtual speaker group may be referred to the description in fig. 11 or 17, and will not be repeated here.

S1903, the electronic device 100 controls the virtual speaker group to play audio data. The process of the electronic device 100 controlling the playing of the audio data is detailed in the foregoing description of fig. 11 or fig. 17, and will not be repeated here.

In some embodiments, the electronic device 100 may send, to each speaker, indication information for adjusting the volume, based on the determined gain required to be adjusted for the audio signal corresponding to each speaker. For example, a mapping relationship between the gain to be adjusted and the volume adjustment value of the audio signal corresponding to the speaker may be preset, and after determining the gain to be adjusted of the audio signal corresponding to the speaker, the electronic device 100 may query the mapping relationship to determine the volume adjustment value of the speaker, and further send indication information to the speaker, where the indication information may include the volume adjustment value.

In some embodiments, the electronic device 100 may also control the loudness of the audio signal played by each real speaker independent of the virtual speaker set to be below a preset loudness value, so as to reduce interference of the speakers, and so that no jamming occurs when the speakers need to be used later. For example, the electronic device 100 may control each of the real speakers independent of the virtual speaker group to adjust the volume to a minimum, or adjust the gain required to adjust the audio signals corresponding to these speakers to a minimum, etc. Of course, the electronic device 100 may also control the respective real speakers independent of the virtual speaker group to pause operation.

It should be noted that, in the method described in the above embodiment, in addition to the speakers located in the horizontal direction in the space, speakers in other directions may be processed to construct corresponding surround sound. For example, the speakers disposed at the top of the space may be processed, and the processing manner may refer to the foregoing manner, which will not be described in detail herein.

3.2 a plurality of loudspeakers are arranged in space and the electronic device can generate a picture, such as a movie, etc. using the electronic device, and the electronic device plays audio data thereon through loudspeakers arranged in space.

By way of example, fig. 20 (a) shows an application scenario in some embodiments of the present application. As shown in fig. 20 (a), 6 speakers, i.e., speakers SP1, SP2, SP3, SP4, SP5, and SP6, are arranged in the vehicle 200. The user U11 views a movie using the electronic device 100 on the right rear seat of the vehicle 200, and the electronic device 100 establishes a connection with the vehicle 200 by a short-range communication method such as bluetooth. The audio data in the electronic device 100 may be played through speakers in the vehicle 200 to obtain a better audible feel.

Fig. 20 (B) shows another application scenario in some embodiments of the present application. As shown in fig. 20 (B), speakers, i.e., speakers SP1, SP2, SP3, SP4, and SP5, are arranged at fixed positions in the room according to certain requirements (e.g., 5.1.X, etc.). The user U11 views a movie using the electronic device 100 on a seat in a room, and a connection is established between the electronic device 100 and a speaker in the room by a short-range communication method such as bluetooth. Audio data in the electronic device 100 may be played through speakers in the room to obtain a better audible feel.

Fig. 20 (C) shows yet another application scenario in some embodiments of the present application. As shown in fig. 20 (C), speakers SP1, SP2, SP3, and SP4 are arranged in the room, and a projection apparatus 400 is arranged. The user U11 is in a seat in a room, and can use the projection device 400 to project contents such as movies in the electronic device 100 onto the wall 500. The electronic device 100 may establish a connection with speakers in a room via short-range communication such as bluetooth. Audio data in the electronic device 100 may be played through speakers in the room to obtain a better audible feel.

In the scenario shown in fig. 20 (a) and (B), when the user U11 plays audio data on the electronic device 100 using a speaker external to the electronic device 100, when the electronic device 100 is not in coordination with the external speaker, the screen displayed by the electronic device 100 is often out of synchronization with the audio data played by the speaker. In the scenario shown in fig. 20 (C), the picture that the user U11 views is displayed on the wall 500, and the sound is played through the speakers in the room, and the position of the picture displayed on the wall 500 is often not coordinated with the position of the speakers, so that there is often an asynchronous situation between the picture displayed on the wall 500 and the audio data played by the speakers.

In order to solve the problem, the embodiment of the present application provides a sound processing method, which may construct a virtual speaker group including at least one virtual speaker around the electronic device 100 (or based on a picture generated by the electronic device 100) based on speakers arranged in space, so that audio data in the electronic device 100 may be played by the virtual speaker group, thereby solving the problem of asynchronous audio and video, and improving the listening experience and viewing consistency experience of a user.

It will be appreciated that in the scenario shown in fig. 20 (a) and (B), the camera 300 may also be configured. The camera 300 may acquire images of the user U11 and the electronic device 100 in space, to determine the positions of the head of the user U11 and the electronic device 100 in space, and/or the distances between the electronic device 100 and the respective speakers, etc. from the acquired images. In the scene shown in fig. 20 (C), the camera 300 may be arranged. The camera 300 may acquire images of the user U11, the electronic device 100, and a screen generated based on the electronic device 100 in space, to determine a position of the head of the user U11 and the electronic device 100 in space from the acquired images, a distance between the electronic device 100 and each speaker, a distance between the screen generated based on the electronic device 100 and each speaker, or a position of the screen generated based on the electronic device 100.

In some embodiments, the camera 300 may be connected to a controller (not shown in the figure) for controlling each speaker through a wired network or a wireless network (such as bluetooth, etc.), so that the camera 300 may transmit the image acquired by the camera to the controller, so that the controller processes the image, such as inputting the image into a pre-trained image processing model, so that the model outputs the position of the head of the user U11 and the electronic device 100 in space, and/or the distance between the electronic device 100 and each speaker, etc. By way of example, the image processing model may be, but is not limited to being, trained based on convolutional neural networks (convolutional neural network, CNN). In other embodiments, the camera 300 may be connected to the electronic device 100 through a wireless network (such as bluetooth, etc.), so that the camera 300 may transmit the image acquired by the camera to the electronic device 100, so that the electronic device 100 may process the image, for example, input the image into a pre-trained image processing model, so that the model outputs the position of the head of the user U11 and the electronic device 100 in space, the distance between the electronic device 100 and each speaker, the distance between the screen generated by the electronic device 100 and each speaker, or the position of the screen generated by the electronic device 100.

In some embodiments, the electronic device 100 may establish a connection with the various speakers via a wireless network (e.g., bluetooth, etc.). At this time, in addition to determining the position of the electronic device 100 in space by the image acquired by the camera 300 and/or the distance between the electronic device 100 and each speaker, the position may be determined based on the wireless communication signals between the electronic device 100 and each speaker, for example: the location of the electronic device 100 in space and/or the distance between the electronic device 100 and the respective speakers may be determined by a ranging method based on the strength indication (received signal strength indication, RSSI) of the received signal. It should be understood that the execution entity for determining the distance between the user a and each speaker may be the electronic device 100, or may be a controller (not shown in the figure) for controlling each speaker, or may be another device located in the scene shown in fig. 1, which is not limited herein. For example, when the electronic device 100 determines the distance of the electronic device 100 from a speaker for the execution subject, the distance between the electronic device 100 and a certain speaker may be determined by "formula one" described in the aforementioned scenario of 3.1. In addition, when the controller for controlling each speaker determines the distance between the electronic device 100 and the speaker for the execution subject, the manner in which the electronic device 100 determines the distance between the electronic device 100 and the speaker for the execution subject may be referred to, and will not be described here.

In addition, after determining the distance between the electronic device 100 and each speaker, the location of the electronic device 100 may be determined based on the distances between the electronic device 100 and at least three speakers. For example, as shown in fig. 24, if the distance between the electronic device 100 and the speaker SP1 is d1, the distance between the electronic device 100 and the speaker SP2 is d2, and the distance between the electronic device 100 and the speaker SP3 is d3, the positions of the speakers SP1, SP2, and SP3 are known and fixed, and thus the positions of the speakers may be respectively defined as circles, and the distance between the corresponding speaker and the electronic device 100 is defined as a radius, and the intersection point of the three circles (i.e., the position E in the figure) is the position of the electronic device 100.

By way of example, fig. 22 illustrates a flow of a sound processing method in some embodiments of the present application. In fig. 22, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. The method shown in fig. 22 may be applied, but is not limited to, in the scene shown in fig. 20 (a) or (B). The subject of execution of the method shown in fig. 22 may be the electronic device 100. As shown in fig. 22, the sound processing method includes the steps of:

S2201, the electronic apparatus 100 determines a target position thereof in a target space in which at least one speaker is arranged.

In this embodiment, the electronic device 100 may determine its position in the target space based on the image acquired by the camera in the space where the user is located, or may determine its position in the target space based on the wireless communication signals between the electronic device and each speaker.

S2202, the electronic device 100 constructs a virtual space matched with the target space according to the target position, wherein the volume of the virtual space is smaller than that of the target space.

In this embodiment, the electronic device 100 may place the target position in a preset spatial model, and associate the target position with a certain component or area in the target space in the spatial model, that is, take the target position as the position of the certain component or area in the target space in the spatial model, so as to construct a virtual space matched with the target space. The virtual space is understood to be a miniaturized target space. For example, the virtual space may be formed by scaling down the target space. The virtual space may be a space that is preset and in which a user can be surrounded. For example, in the scenario shown in fig. 20 (a), the spatial model may be a small virtual vehicle in which the target position may be placed at the position of the display screen of the vehicle body in the vehicle 200. In the scenario shown in fig. 20 (B), the spatial model may be a small virtual room in which the target position may be placed at the position of the wall body directly in front of the user U11 in the room.

S2203, the electronic device 100 constructs a virtual speaker group in the virtual space according to the positions of the speakers in the target space, where the virtual speaker group includes virtual speakers corresponding to the speakers in the target space.

In this embodiment, the electronic device 100 may determine, in the virtual space, the positions of the virtual speakers corresponding to the speakers in the target space based on the ratio of the virtual space to the target space.

For example, taking the scenario shown in fig. 20 (a) as an example, as shown in fig. 23, the virtual space is the virtual vehicle 41, and the position of the electronic device 100 is the position of the display screen of the vehicle in the virtual vehicle 2301. The distance and angle between the display screen of the vehicle and the respective speakers are fixed in the vehicle 200. If the ratio between the virtual vehicle 2301 and the vehicle 200 is 1:10, and the distance between the speaker SP1 and the display screen 210 of the vehicle in the vehicle 200 is d1 and the angle is α, one virtual speaker VSP1 may be disposed in the virtual vehicle 2301 at a position that is d1/10 away from the electronic device 100 and the angle is α. For the manner of arranging the other virtual speakers in the virtual vehicle 2301, reference may be made to the manner of arranging the virtual speakers VSP1, which is not described herein.

After the positions of the virtual speakers are determined in the virtual space, the gains to be adjusted for the audio signals corresponding to the speakers in the target space can be determined according to the distances between the virtual speakers and the speakers in the target space, and the gains of the audio signals corresponding to the speakers in the target space can be adjusted to construct a virtual speaker group, so that the speakers in the target space are mapped into the virtual space. Wherein the virtual speaker group includes virtual speakers corresponding to respective speakers in the target space. By way of example, a virtual speaker may be understood as a virtual speaker and a speaker configured in the vehicle 200 may be understood as a real speaker.

In some embodiments, the electronic device 100 may determine the gain to be adjusted for the audio signal corresponding to each speaker through the distance between the speaker in the target space and the virtual speaker in the virtual space and a preset distance model.

For example, taking the scenario shown in fig. 20 (a) as an example, with continued reference to fig. 23, if the preset distance model is g=k×d+b, g is the gain to be adjusted for the audio signal corresponding to the speaker, k and b are constants, and d is the distance between the virtual speaker and the real speaker. If the distance between the virtual speaker VSP1 and the speaker SP1 is d2, the gain to be adjusted for the audio signal corresponding to the speaker SP1 is g1=k×d2+b, and at this time, the electronic device 100 adjusts the gain of the audio signal corresponding to the speaker SP1 in the audio data to be played, and the adjustment value is g1, so that the speaker SP1 may be mapped into the virtual vehicle 41, thereby constructing a virtual speaker corresponding to the speaker SP1 in the virtual vehicle 41. Virtual speakers corresponding to the remaining individual speakers may be constructed in the virtual vehicle 41 using the distances and distance models between the remaining individual speakers and the corresponding virtual speakers. In addition, the electronic device 100 may record the gain value of the audio signal corresponding to each speaker, and adjust the audio data to be played later.

S2204, the electronic device 100 plays the target audio data by using the virtual speaker group, where the gain of each channel in the target audio data is obtained by the gain that needs to be adjusted during the process of constructing the virtual speaker group based on the audio signals corresponding to each speaker in the target space.

In this embodiment, after the virtual speaker group is constructed, the electronic device 100 may play the target audio data using the virtual speaker group. For example, the electronic device 100 may transmit audio signals corresponding to different channels included in the target audio data to the corresponding speakers for playing. The gain of each channel in the target audio data is obtained by the gain which is required to be adjusted based on the audio signals corresponding to each loudspeaker in the target space in the process of constructing the virtual loudspeaker group.

Therefore, when a user uses the audio data on the electronic equipment used by the external loudspeaker player, the sound heard by the user is approximately generated from the electronic equipment and surrounds the periphery of the user, so that the picture played by the electronic equipment is synchronous with the sound, and the hearing and viewing consistency experience of the user is improved.

By way of example, fig. 24 illustrates a flow of another sound processing method in some embodiments of the present application. In fig. 24, a connection may be established between the electronic device 100 and each speaker, but is not limited to, through bluetooth. The method shown in fig. 24 may be applied, but is not limited to, in the scene shown in fig. 20 (a) or (B). The subject of execution of the method shown in fig. 24 may be the electronic device 100. As shown in fig. 24, the sound processing method includes the steps of:

S2401, the electronic device 100 determines a first distance between the electronic device and the head of the user, and determines a first position of the head of the user in a target space in which at least one speaker is disposed.

In this embodiment, the electronic device 100 may determine a first distance between the camera and the head of the user based on the image acquired by the camera in the space where the user is located, and determine a first position of the head of the user in the target space.

S2402, the electronic device 100 constructs a virtual speaker group according to the first distance, the first position, and the positions of the speakers in the target space, where the virtual speaker group includes virtual speakers corresponding to the speakers in the target space, and each virtual speaker is on a circle with the first position as a center and the first distance as a radius.

In this embodiment, the electronic device 100 may construct a circle with the first distance as a radius and the first position as a center, and virtualize each speaker in the target space to the circle. In some embodiments, the electronic device 100 may virtualize each speaker in the target space to its constructed circle based on the distance between the first location and the location of each speaker. For example, a distance model may be preset, and the distance between the first position and the position of each speaker may be input into the distance model, so that the gain that needs to be adjusted for the audio signal corresponding to each speaker may be obtained, and the gain of the audio signal corresponding to each speaker in the target space may be adjusted, so as to construct the virtual speaker group.

For example, taking the scenario shown in fig. 20 (a) as an example, please refer to fig. 25, if the preset distance model is g=k×d+b, g is the gain to be adjusted for the audio signal corresponding to the speaker, k and b are constants, and d is the distance between the position of the user's head and the real speaker. If the distance between the head of the user U11 and the speaker SP1 in the vehicle 200 is d1, the gain to be adjusted for the audio signal corresponding to the speaker SP1 is g1=k×d1+b, and at this time, the electronic device 100 adjusts the gain of the audio signal corresponding to the speaker SP1 in the audio data to be played, and the adjustment value is g1, so that the speaker SP1 can be virtualized to the circle on which it is constructed, thereby obtaining the virtual speaker VSP1. Based on the same implementation, the electronic device 100 may virtualize other speakers in the vehicle 200 to the circle on which it is built, i.e., build a virtual speaker group.

S2403, the electronic device 100 plays the target audio data using the virtual speaker group, where the gain of each channel in the target audio data is obtained from the gain to be adjusted based on the audio signal corresponding to each speaker in the target space in the process of constructing the virtual speaker group.

In this embodiment, after the virtual speaker group is constructed, the electronic device 100 may play the target audio data using the virtual speaker group. For example, the electronic device 100 may transmit audio signals corresponding to different channels included in the target audio data to the corresponding speakers for playing. The gains of the various channels in the target audio data are obtained by gains which are required to be adjusted in the process of constructing the virtual speaker group based on the audio signals corresponding to the various speakers in the target space.

In some embodiments, after the virtual speaker group is determined, another virtual speaker group may also be constructed from the virtual speaker group using a vector base amplitude panning (vector base amplitude panning, VBAP) algorithm. The manner of constructing the virtual speaker group may be referred to the related description in the foregoing 3.1 scenario, which is not repeated herein. The newly constructed virtual speaker group may be composed of M virtual speakers, where the value of M is equal to the number of speakers required for constructing space surround sound, and the arrangement manner of each virtual speaker in the virtual speaker group is the same as the arrangement manner of speakers required for constructing space surround sound. After the most recent virtual speaker group is constructed, the electronic device 100 may play the target audio data using the virtual speaker group. Thus, the user can enjoy the space surround sound, and the user experience is improved.

Fig. 26 illustrates, for example, a flow of yet another sound processing method in some embodiments of the present application. In fig. 26, a connection may be established between the electronic device 100 and each speaker, but is not limited to, through bluetooth. The method shown in fig. 26 may be applied, but is not limited to, in the scene shown in fig. 20 (a) or (B). The subject of execution of the method shown in fig. 26 may be an acoustic device control system, which may be used to control the individual speakers. In fig. 26, S802 and S803 can be referred to the description in S2202 and S2203 in fig. 22; in addition, in S2603, the audio device control system records the gain to be adjusted for the audio signal corresponding to each speaker, and in S2203, the electronic device 100 may record the gain to be adjusted for the audio signal corresponding to each speaker, or directly adjust the gain of the corresponding channel in the audio data to be played. As shown in fig. 26, the sound processing method includes the steps of:

S2601, the audio device control system determines a target position of the electronic device 100 in a target space in which at least one speaker is disposed.

In this embodiment, the audio device control system may determine the position of the electronic device 100 in the target space based on the image acquired by the camera in the space where the user is located, or may determine the position of the electronic device 100 in the target space by using wireless communication signals between the electronic device 100 and each speaker.

S2602, the sound equipment control system constructs a virtual space matched with the target space according to the target position, wherein the volume of the virtual space is smaller than that of the target space.

S2603, the sound equipment control system constructs a virtual speaker group in the virtual space according to the positions of the speakers in the target space, wherein the virtual speaker group comprises virtual speakers corresponding to the speakers in the target space.

S2604, the audio device control system acquires target audio data sent by the electronic device 100, adjusts gains of all channels in the target audio data by using gains required to be adjusted for audio signals corresponding to all speakers, and plays the adjusted target audio data.

In this embodiment, after the audio device control system obtains the target audio data sent by the electronic device 100, the gain of each channel in the target audio data may be adjusted according to the gain required to be adjusted for the audio signal corresponding to each speaker recorded in the process of constructing the virtual speaker group, and the adjusted target audio data may be played.

Fig. 27 illustrates, for example, a flow of yet another sound processing method in some embodiments of the present application. In fig. 27, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. The method shown in fig. 27 may be applied, but is not limited to, in the scene shown in fig. 20 (a) or (B). The subject of execution of the method shown in fig. 27 may be an acoustic device control system, which may be used to control the individual speakers. In fig. 27, S902 may be described with reference to S2402 in fig. 24, and S2703 may be described with reference to S2604 in fig. 26. As shown in fig. 27, the sound processing method includes the steps of:

S2701, the audio device control system determines a first distance between the electronic device 100 and the head of the user, and determines a first position of the head of the user in a target space in which at least one speaker is arranged.

In this embodiment, the audio device control system may determine a first distance between the electronic device 100 and the head of the user based on the image acquired by the camera in the space where the user is located, and determine a first position of the head of the user in the target space.

S2702, the sound equipment control system constructs a virtual speaker group according to the first distance, the first position and the positions of the speakers in the target space, wherein the virtual speaker group comprises virtual speakers corresponding to the speakers in the target space, and the virtual speakers are all positioned on a circle taking the first position as a circle center and the first distance as a radius.

S2703, the audio equipment control system acquires target audio data sent by the electronic equipment 100, adjusts gains of all channels in the target audio data by utilizing gains required to be adjusted for audio signals corresponding to all speakers, and plays the adjusted target audio data.

In some embodiments, after the virtual speaker group is determined, another virtual speaker group may also be constructed from the virtual speaker group using a vector base amplitude panning (vector base amplitude panning, VBAP) algorithm. The manner of constructing the virtual speaker group may be referred to the related description in the foregoing 3.1 scenario, which is not repeated herein. The newly constructed virtual speaker group may be composed of M virtual speakers, where the value of M is equal to the number of speakers required for constructing space surround sound, and the arrangement manner of each virtual speaker in the virtual speaker group is the same as the arrangement manner of speakers required for constructing space surround sound. After the latest virtual speaker group is constructed, the audio device control system can play the target audio data using the virtual speaker group. Thus, the user can enjoy the space surround sound, and the user experience is improved.

By way of example, fig. 28 illustrates a flow of yet another sound processing method in some embodiments of the present application. In fig. 28, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. The method shown in fig. 28 may be applied, but is not limited to, in the scene shown in fig. 20 (C). The subject of execution of the method shown in fig. 28 may be the electronic device 100. In fig. 28, S2803 to S2804 refer to the descriptions in S2203 to S2204 in fig. 22, and are not repeated here. As shown in fig. 28, the sound processing method includes the steps of:

s2801, the electronic device 100 determines a target position in a target space where at least one speaker is disposed, based on a screen generated by the target position.

In this embodiment, the electronic device 100 may acquire, through an image captured by a camera, a target position where a generated picture is located in a target space. In addition, the user may preset the configuration target position in the electronic device 100, and the configuration target position may be specific according to the actual situation, which is not limited herein.

S2802, the electronic device 100 constructs a virtual space matching the target space according to the target position, where the volume of the virtual space is smaller than the volume of the target space.

In this embodiment, the electronic device 100 may place the target position in a preset spatial model, and associate the target position with a certain component or area in the target space in the spatial model, that is, take the target position as the position of the certain component or area in the target space in the spatial model, so as to construct a virtual space matched with the target space. The virtual space is understood to be a miniaturized target space. For example, the virtual space may be formed by scaling down the target space. The virtual space may be a predetermined space. For example, in the scenario shown in fig. 20 (C), the spatial model may be a small virtual room in which the target position may be placed at a position on the wall 500 immediately in front of the user U11 in the room.

S2803, the electronic device 100 constructs a virtual speaker group in the virtual space according to the positions of the speakers in the target space, where the virtual speaker group includes virtual speakers corresponding to the speakers in the target space.

S2804, the electronic device 100 plays the target audio data using the virtual speaker group, where the gain of each channel in the target audio data is obtained by the gain that needs to be adjusted during the process of constructing the virtual speaker group based on the audio signals corresponding to each speaker in the target space.

Therefore, when a user uses the projection device to watch a picture on the electronic device and uses audio data on the electronic device used by an external loudspeaker player, sound heard by the user is approximately generated on the picture projected by the projection device, so that the picture generated based on the electronic device is synchronous with the sound, and the hearing and viewing consistency experience of the user is improved.

Fig. 29 illustrates, for example, a flow of yet another sound processing method in some embodiments of the present application. In fig. 29, a connection may be established between the electronic device 100 and the respective speakers, but is not limited to, via bluetooth. The method shown in fig. 29 may be applied, but is not limited to, in the scene shown in fig. 20 (C). The execution subject of the method shown in fig. 29 may be an acoustic device control system, which may be used to control the individual speakers. In fig. 29, S2901 to S2903 may refer to the descriptions in S2801 to S2803 in fig. 28, and S2904 in fig. 29 may refer to the descriptions in S2604 in fig. 26, which are not repeated here. As shown in fig. 29, the sound processing method includes the steps of:

S2901, the acoustic device control system determines a target position in a target space where at least one speaker is arranged based on a screen generated by the electronic device 100.

S2902, the sound equipment control system constructs a virtual space matched with the target space according to the target position, wherein the volume of the virtual space is smaller than that of the target space.

S2903, the audio device control system constructs a virtual speaker group in the virtual space according to the positions of the speakers in the target space, the virtual speaker group including virtual speakers corresponding to the speakers in the target space.

S2904, the audio device control system acquires the target audio data sent by the electronic device 100, adjusts the gain of each channel in the target audio data by using the gain to be adjusted for the audio signal corresponding to each speaker, and plays the adjusted target audio data.

In some embodiments, when there is a distance between the user and the speaker that is greater than the preset distance threshold, and/or when the distance between the user and the screen generated by the electronic device 100 is greater than the preset distance threshold, the time delays corresponding to the speakers in the target space may also be determined respectively, so that the screen seen by the user and the sound heard by the user are matched, thereby improving the user experience. For example, the time delay of each speaker in the target space may be determined based on a target distance between the user and the screen generated by the electronic device 100. For example, if the target distance is d1 and the distance between the electronic device 100 and one speaker in the target space is d2, the delay of the speaker= (d 2-d 1)/v, where v is the propagation speed of sound in air. For example, in the scene shown in fig. 20 (a) or (B), the distance between the user and the screen generated based on the electronic device 100 may be: the distance between the user U11 and the electronic device 100; in the scenario shown in fig. 20 (C), the distance between the user and the screen generated based on the electronic device 100 may be: the distance between the user U11 and the wall 500 may be, but is not limited to, obtained by the camera 300 in the room. For example, when the calculated delay of a certain speaker is a positive number, the speaker is shown to be farther away from the user, so that the speaker can be controlled to play in advance at this time, for example: the advanced time may be the determined delay; when the calculated delay of a certain speaker is negative, the speaker is indicated to be closer to the user, so that the speaker can be controlled to delay playing at this time, for example: the time of the delay may be the absolute value of the determined delay.

After determining the corresponding time delay of each speaker, the electronic device 100 or the audio device control system may control each speaker to play audio data in advance or in delay according to the corresponding time delay. Therefore, the pictures seen by the user are matched with the sounds heard by the user, and the user experience is improved.

Further, a distance can be selected from the determined distances between the picture generated by the target device and each loudspeaker in the target space as a reference distance; and determining the appearance time of the picture generated by the target device according to the reference distance. Therefore, the effect of sound and picture synchronization is improved. The reference distance may be, for example, the largest one of the determined distances between the screen generated by the target device and the respective speakers in the target space. For example, a delay time of the generated image relative to the sound generated by the speaker corresponding to the reference distance can be determined based on the reference distance and the propagation speed of the sound; and then, after the target equipment is controlled to play corresponding audio data at the moment of the loudspeaker corresponding to the reference distance, and when the delay time is reached, displaying a corresponding picture. For example, if the determined delay time is 3s and the time when the speaker corresponding to the reference distance plays the corresponding audio data is t, the time when the picture generated by the target device appears is (t+3).

4. And controlling the scene of acceleration running of the new energy vehicle.

Generally, a user is using a new energy vehicle (hereinafter, simply referred to as a "vehicle"), and the vehicle can cyclically play a sound wave according to its own traveling state. Such as: the vehicle can gradually increase the volume of its speaker to a maximum value when accelerating and gradually decrease the volume of its speaker to a minimum value when decelerating. In this way, however, the sound wave sound only changes in volume, and there is no spatial change, i.e., no spatially-localized sound wave is formed, which makes the sound wave sound played by the vehicle coincide with the actual driving state to a large extent.

In addition, the sound wave audio can be divided into extremely short segments in units of milliseconds, and the corresponding segments can be selected according to the speed of the vehicle and the like. And superposing and synthesizing the selected fragments, and playing the synthesized data to restore the real sound wave effect. However, in this way, the sound wave sound is still only volume change, and there is no spatial change, i.e. no spatially localized sound wave is formed, and the user experience is poor.

In order to solve the above problems, the embodiments of the present application provide a sound processing method, which can make the acoustic wave sound generate a spatial change in the process of using a vehicle by a user, so that a doppler effect appears in the interior of the vehicle, so that the acoustic wave sound played by the vehicle matches with a real driving state, further making the sense of hearing more real, and improving the user experience.

By way of example, fig. 30 shows a hardware structure of a vehicle. As shown in fig. 30, the vehicle 200 may be provided with the electronic device 100 and the speaker 210 therein. Wherein the electronic device 100 may transmit the sound waves to the speaker 210 for playback through the speaker 210. By way of example, the electronic device 100 may be, but is not limited to, an in-vehicle terminal. The number and location of speakers 210 may be configured as desired, and are not limited herein.

In addition, components necessary for the normal operation of the vehicle 200, such as various sensors, may be disposed in the vehicle 200, which is not limited herein. In some embodiments, the vehicle 200 may have sensors configured therein for sensing vehicle motion conditions, such as: a speed sensor, an acceleration sensor, etc.

By way of example, fig. 31 shows a sound processing method. It will be appreciated that the method may be, but is not limited to being, performed by an electronic device (such as an in-vehicle terminal, etc.) configured in a vehicle. As shown in fig. 31, the sound processing method may include the steps of:

s3101, the electronic apparatus 100 determines current running parameters of the vehicle 200, including one or more of a running speed, a rotation speed, and an opening degree of an accelerator pedal.

In this embodiment, after the sensor in the vehicle 200 senses the running parameter of the vehicle 200, the running parameter may be transmitted to the electronic device 100.

S3102, the electronic apparatus 100 determines first audio data corresponding to the travel speed according to the travel parameter.

In this embodiment, the electronic device 100 may determine the first audio data corresponding to the driving parameter according to the driving parameter and the preconfigured original audio data.

For example, the electronic device 100 may first obtain audio particles derived from the original audio data. Wherein each audio particle may correspond to a travel speed of the vehicle. By way of example, audio particles may be understood as data formed by dividing the original audio data into very short segments (e.g., segments in milliseconds, etc.). The original audio data may be default audio data or audio data selected by the user, which is not limited herein. When the original audio data is audio data selected by the user himself, a selection portal may be configured on the electronic device 100 for selection by the user.

Then, the electronic device 100 may determine the audio particle corresponding to the current driving parameter according to the mapping relationship between the driving parameter and the audio particle. Finally, the determined audio particles are subjected to telescopic transformation by utilizing the current acceleration of the vehicle 200 so as to adjust the data length of the audio particles, thereby matching the playing speed of the audio particles with the current driving state. The first audio data are audio particles after the expansion transformation.

For example, when the driving parameter is a driving speed, if the mapping relationship between the driving speed and the audio particles is: when the speed is a1, the audio particle is an audio particle b1; when the velocity is a2, the audio particle is the audio particle b2. When the running speed of the vehicle determined at time t1 is a2, the currently required audio particle is determined to be the audio particle b2 from the mapping relationship between the running speed and the audio particle. If the running speed of the vehicle 200 is determined to be a0 at time t0, the acceleration of the vehicle 200 at time t1 is (a 2-a 0)/(t 1-t 0). And then, inquiring a mapping relation between the preset acceleration and the expansion change value by using the determined acceleration, and determining the expansion change value corresponding to the current acceleration. Finally, the audio particle b2 may be processed by a time-domain companding (time-scale modificatio, TSM) algorithm based on the scaling variation value, so as to complete the scaling transformation of the audio particle b2, thereby obtaining the first audio data.

As a possible implementation, the original audio data may be first scaled using different scaling values. Then, the audio data after the expansion transformation are respectively segmented. The audio particles obtained by segmentation can correspond to a certain audio particle in the original audio data, each audio particle in the original audio data corresponds to a particle group, the particle group comprises at least one audio particle subjected to telescopic transformation, and different audio particles in the particle group correspond to different telescopic variation values. Since each traveling speed may correspond to one audio particle in the original audio data, each traveling speed may correspond to the aforementioned one particle group. For example, one audio particle may correspond to one speed interval, i.e. the speeds in the speed interval all correspond to the same particle. For example, the speed interval corresponding to the audio particle a may be (20 km/h,25 km/h).

In addition, besides performing telescopic transformation on the original audio data and then performing segmentation, the original audio data can be segmented, and then the audio particles obtained after segmentation can be subjected to telescopic transformation by utilizing different telescopic transformation values. And in particular, the method may be applied according to practical situations, and is not limited herein.

For example, if the original audio data is subjected to the scaling transform by using the scaling values x1 and x2, and the scaled audio data is sliced, the audio particles b0 in the original audio data may correspond to the audio particles b1 subjected to the scaling transform by using the scaling value x1 and the audio particles b2 subjected to the scaling transform by using the scaling value x2, and the particle group corresponding to the audio particles b0 is composed of the audio particles b1 and b 2. Wherein the corresponding time points of the audio particles b0, b1 and b2 are the same; it is also understood that the audio particle b1 is obtained by scaling the audio particle b0 by the scaling value x1, and the audio particle b2 is obtained by scaling the audio particle b0 by the scaling value x 2.

Further, after determining the current running speed of the vehicle 200, a particle group may be determined according to the running speed. And then, according to the current acceleration, inquiring a mapping relation between the preset acceleration and the telescopic change value, and determining the telescopic change value corresponding to the current acceleration. Finally, based on the expansion and contraction change value, the relation between each audio particle in the particle group and the expansion and contraction change value can be inquired, and the needed audio particle can be determined from the particle group, wherein the audio particle is the first audio data.

When the running parameter is the rotation speed or the opening degree of the accelerator pedal, reference may be made to the case where the running parameter is the running speed, and details thereof will not be repeated here.

S3103, the electronic device 100 adjusts the gain of each channel in the first audio data according to the driving parameter, so as to obtain the second audio data.

In this embodiment, the electronic device 100 may determine the gain to be adjusted according to the running speed and a preset gain adjustment model, and adjust the gain of each channel in the first audio data to obtain the second audio data. Illustratively, the gain adjustment model may be a linear model, such as: y=kx+b, y is the gain to be adjusted, k and b are constants, and x is the acceleration. Wherein the acceleration in the linear model may be determined by the relation between the running speed, time and acceleration. At this time, it can be understood that the acceleration of the vehicle is determined according to the running speed, and then the gain of each channel in the first audio data is adjusted according to the acceleration.

In some embodiments, to prevent the occurrence of a sudden change in volume, the range of each gain adjustment may be set. When the determined gain to be adjusted exceeds the preset range, the maximum value of the preset range can be used as the gain to be adjusted at the time.

In some embodiments, to prevent the occurrence of a volume inattention condition, a condition may be set for adjusting the gain, such as when the change in the travel speed exceeds a preset speed value (e.g., 3km/h, etc.), or else the gain is not adjusted. In other words, when the variation value of the running speed of the vehicle 200 exceeds a certain speed value, the gain may be adjusted.

S3104, the electronic apparatus 100 determines a target speed at which the sound field moves in the target direction, based on the travel parameter.

In this embodiment, the electronic apparatus 100 may determine the acceleration of the vehicle 200 using the traveling speed of the vehicle 200. And then, inquiring a mapping relation between the preset acceleration and the speed of the sound field moving towards the target direction by utilizing the determined acceleration, and determining the target speed of the sound field moving towards the target direction. In some embodiments, the target direction may be a front-to-rear direction of the vehicle 200.

S3105, the electronic device 100 plays the second audio data by using the speakers in the target speaker group, where the target speaker group includes at least two speakers, and adjusts the gain of each channel in the second audio data according to the target speed, where the target speaker group is used to control the sound field to move in the target direction at the target speed.

In some embodiments, the initial position of the sound field may be preset, for example, the initial position of the vehicle 200 may be a position in front of the driver in the vehicle 200. When the second audio data is played, the position of the sound field may be gradually controlled from the initial position to the rear of the vehicle 200 according to the target speed. By way of example, the location of a sound field may be understood as the location of a sound source perceived by a user.

For example, as shown in fig. 32, two speakers are arranged in front of the driver in the vehicle 200, and the vehicle 200 is accelerating, for example, the speaker SP1 is arranged in front of the left of the driver, and the speaker SP2 is arranged in front of the right of the driver. In (a) of fig. 32, the region where the position 3201 is located may be an initial position of the sound field, which may be a gain of audio signals corresponding to the default speakers SP1 and SP2, and the positions of the sound field when both play the sound. During the acceleration running of the vehicle 200, the position of the sound field at the next time, such as the region where the position 3202 in (B) of fig. 32 is located, may be determined from the target speed of the sound field movement. At this time, a virtual speaker VSP1 may be virtualized at the position 3202 by adjusting the gains of the audio signals corresponding to the speakers SP1 and SP 2. Meanwhile, the gain of the corresponding sound channel in the second audio data is adjusted by utilizing the gain which is required to be adjusted by the audio signal corresponding to the loudspeaker SP 1; and adjusting the gain of the corresponding channel in the second audio data by using the gain required to be adjusted for the audio signal corresponding to the loudspeaker SP2, thereby completing the adjustment of the gain of each channel in the second audio data. The electronic device 100 may then play the second audio data using the speakers SP1 and SP 2. Thus, the driver hearing the sound is equivalent to playing at location 3202. Thereby, a movement of the sound field in space is achieved. The speaker SP1 and the speaker SP2 are target speaker groups. In addition, the electronic device 100 processes the second audio data based on the gains to be adjusted for the audio signals corresponding to the speakers, and may also adjust the volumes of the speakers according to the gains to be adjusted for the audio signals corresponding to the speakers, and play the second audio data, so as to realize movement of the sound field. In some embodiments, when one speaker is virtually derived from multiple real speakers, it may operate using, but is not limited to, a vector base amplitude panning (vector base amplitude panning, VBAP) algorithm. The process of constructing the virtual speaker based on the VBAP algorithm may be referred to the description in the foregoing 3.1 scenario, and will not be described herein. In addition, the gain of the audio signal corresponding to each speaker in the target speaker group may be determined by a preset distance gain model, and the second audio data may be adjusted based on the gain, so as to virtually obtain one speaker. For example, with continued reference to (B) of fig. 4, if the distance between the user U11 and the location 3201 is L1 and the distance between the user U11 and the location 3202 is L2, when one speaker is virtually located at the location 3202, the gain to be adjusted for the audio signals corresponding to the speakers SP1 and SP2 may be g=l2/L1. Where, the distance gain model is gi=x2/x 1, x1 is the distance between the initial position of the sound field and the reference point, x2 is the distance between the current position of the sound field and the reference point, and the position where the user U11 is located in (B) of fig. 32 is the reference point.

In controlling the sound field movement, the movement may be performed by other means than the one described in fig. 32, which is not limited herein. For example, when a plurality of speakers are arranged on both sides of the vehicle 200, one virtual speaker may be virtually formed on each side, respectively, and the second audio data may be played using the virtual speaker.

For example, as shown in fig. 33, taking an example in which two speakers are arranged on both sides of the vehicle 200 and the vehicle 200 is traveling with acceleration, the speaker SP1 is arranged in front of the driver, the speaker SP2 is arranged in front of the driver, the speaker SP3 is arranged right to the left of the driver, and the speaker SP4 is arranged right to the driver. In fig. 33 (a), the area where the position 3301 is located may be an initial position of the sound field, which may be a gain of audio signals corresponding to the default speakers SP1, SP2, SP3, and SP4, and the positions of the sound field when both play the sound. During the acceleration running of the vehicle 200, the position of the sound field at the next time, such as the region where the position 3302 in (B) of fig. 33 is located, may be determined from the target speed of the sound field movement. At this time, one virtual speaker VSP1 may be virtualized at the left side of the vehicle 200 by adjusting the gains of the audio signals corresponding to the speakers SP1 and SP 3; by adjusting the gains of the audio signals corresponding to the speakers SP2 and SP4, a virtual speaker VSP2 can be virtualized on the right side of the vehicle 200. The manner of determining the gains to be adjusted for the audio signals corresponding to the speakers SP1, SP2, SP3 and SP4 may be referred to the determination manner described in fig. 32, for example, the determination based on the distance gain model, etc., which are described in detail in the foregoing description and will not be repeated here. Further, the gain of the corresponding channel in the second audio data may be adjusted by using the gain to be adjusted for the audio signal corresponding to the speaker SP1; the gain of the corresponding sound channel in the second audio data is adjusted by utilizing the gain which is required to be adjusted by the audio signal corresponding to the loudspeaker SP 2; the gain of the corresponding sound channel in the second audio data is adjusted by utilizing the gain which is required to be adjusted by the audio signal corresponding to the loudspeaker SP 3; and adjusting the gain of the corresponding channel in the second audio data by utilizing the gain required to be adjusted for the audio signal corresponding to the loudspeaker SP 4. Next, the electronic apparatus 100 may play the second audio data through the speakers SP1, SP2, SP3, and SP 4. Thus, the driver hearing sound is equivalent to being played by virtual speakers VSP1 and VSP2. Thereby, a movement of the sound field in space is achieved. The speaker SP1, the speaker SP2, the speaker SP3, and the speaker SP4 are target speaker groups. In addition, the electronic device 100 may process the second audio data based on the gains to be adjusted for the audio signals corresponding to the speakers, and may adjust the volumes of the speakers according to the gains to be adjusted for the audio signals corresponding to the speakers, respectively, to thereby realize the movement of the sound field.

In some embodiments, the virtual location of the sound source of the target audio data may be determined based on the target velocity. Then, speakers controlling the sound field movement are screened out from the vehicle according to the virtual position. Then, according to the virtual position, the target gains to be adjusted of the audio signals corresponding to the screened loudspeakers can be determined, and F target gains are obtained, wherein F is more than or equal to 2. Then, the gains of the channels in the second audio data can be adjusted according to the F target gains to obtain the target audio data. Finally, the target audio data may be played using the screened speakers. The screened speakers are the target speaker group.

In some embodiments, in the process of controlling the sound field to move, the second audio data may be doppler processed according to the target speed, the position of the user, the initial position of the sound field, and the like, so that the sound heard by the user has a process of changing the tone, and user experience is improved.

Therefore, in the process that a user drives the vehicle, the movement of the sound field in the vehicle is controlled according to the loudspeaker in the vehicle, so that the sound wave sound can generate spatial change, the Doppler effect can appear in the vehicle, the sound wave sound played by the vehicle is matched with the real driving state, the hearing is more real, and the user experience is improved.

In some embodiments, to enable a visual experience, the color of the mood light in the vehicle 200 may also be controlled to gradually change following the length of acceleration of the vehicle 200. For example, as shown in fig. 34, the color of the atmosphere lamp may be controlled to gradually change from light to dark as the acceleration period increases, such as: the color of the atmosphere lamp gradually changes from light yellow to dark yellow, finally changes into red and the like during acceleration. In some embodiments, the speed of the color change of the atmosphere lamp may be controlled to be the same as the target speed of the sound field movement, so that the spatial hearing sensation and the spatial vision sensation in the vehicle 200 correspond, and the user experience is improved. In some embodiments, the mood light in the vehicle 200 may be a light strip that may exhibit an gradual change in color.

In some embodiments, different background noise (i.e., background noise) may also be added at different speed intervals in order to make the listening of the sound produced by the target audio data more graceful. For example, different audio frequencies can be selected as the background mixed play within different speed ranges. For example: when the running speed of the vehicle is less than 50km/h, taking the audio particles extracted from the audio 1 as background noise, and playing the background noise in a mixed mode with target audio data; when the running speed of the vehicle is less than 100km/H and greater than 50km/H, the audio particles extracted by the audio 2 are taken as the background noise, and are mixed with the target audio data to be played. The audio 1 and the audio 2 may be preset audio, and different speed intervals may correspond to different audio particles, where the audio particles are mainly used as background noise.

In some embodiments, the foregoing method may be performed by an electronic device (such as a mobile phone or the like) located in the vehicle and separate from the vehicle, in addition to an electronic device (such as an in-vehicle terminal or the like) configured in the vehicle. When executed by an electronic device separate from the vehicle, the placement locations of the speakers in the vehicle may be pre-configured in the electronic device so that the electronic device may determine the gains to be adjusted for the audio signals corresponding to the respective speakers. In this implementation, the running speed of the vehicle may be transmitted from the vehicle to the electronic device, or may be perceived by the electronic device itself, which is not limited herein. In addition, in this implementation manner, the electronic device may first adjust the gain of each channel in the second audio data, and then send the adjusted audio data to the vehicle for playing.

In addition, the foregoing method may also be performed by the vehicle or an electronic device (such as a vehicle-mounted terminal) integrated in the vehicle in a selected portion according to the actual situation, and performed by an electronic device (such as a mobile phone) separate from the vehicle in another portion, that is, the execution subject of each step in the foregoing method may be adaptively adjusted according to the requirement, and the adjusted solution is still within the protection scope of the present application. For the solution after adjusting the execution body, reference may be made to the description in the foregoing method, which is not described in detail herein.

5. And driving, navigating by using electronic equipment in the vehicle, and enabling a driver to have a driving fatigue scene.

By way of example, fig. 35 illustrates an application scenario in some embodiments of the present application. As shown in fig. 35, during the driving of the vehicle 200 by the driver a toward the destination, the driver a can navigate to the destination using the electronic device 100 located in the vehicle 200. When driver a is tired, the characteristic parameters (such as tone, gain, etc.) of the audio data broadcasted by the electronic device 100 in navigation may be changed, so that the driver may increase his attention under the impact of hearing, and safe driving may be realized.

In fig. 35, the electronic apparatus 100 is located in the vehicle 200, and may be an apparatus integrated in the vehicle 200, such as an in-vehicle terminal, or may be an apparatus separate from the vehicle 200, such as a mobile phone of the driver a, etc., which is not limited herein. When the electronic device 100 is integrated in the vehicle 200, the electronic device 100 may directly broadcast audio data that it needs to broadcast using speakers in the vehicle 200. When the electronic device 100 is disposed apart from the vehicle 200, a connection may be established between the electronic device 100 and the vehicle 200 by, but not limited to, short-range communication (e.g., bluetooth, etc.). When the electronic device 100 is separately disposed from the vehicle 200, the electronic device 100 may transmit the audio data to be broadcasted to the vehicle 200 and broadcast the audio data through a speaker on the vehicle 200, or the electronic device 100 may broadcast the audio data to be broadcasted through a built-in speaker.

In addition, an image pickup device such as a camera may be provided inside the vehicle 200 to collect face data of the driver. In addition, a speaker may be further disposed in the vehicle 200, and the navigation sound to be broadcast in the electronic device 100 may be broadcast through the speaker on the vehicle 200. In some embodiments, a sensor (e.g., radar, camera, etc.) for collecting traffic information may be provided on the exterior of the vehicle 200.

By way of example, fig. 36 illustrates a sound processing method in some embodiments of the present application. In fig. 36, the electronic device 100 may be a device integrated in the vehicle 200, such as an in-vehicle terminal, or may be a device separate from the vehicle 200, such as a mobile phone of the driver a, or the like. As shown in fig. 36, the method may include the steps of:

s3601, the electronic device 100 determines the fatigue level of the driver.

In the present embodiment, an image pickup device such as a camera or the like may be provided in the vehicle 200. The image acquisition device can acquire the face data of the driver A in real time or periodically (such as acquiring every 2 seconds, 3 seconds or 5 seconds, etc.), for example: eyes, mouth, etc. Wherein the vehicle 200 may transmit the face data of the driver a, which is acquired by the image acquisition device thereon, for a certain period of time (e.g., 5 seconds, etc.), to the electronic apparatus 100. For example, the vehicle 200 may cache face data collected for a short period of time (5 s or 10 s) based on a dynamic sliding window. For example, the vehicle 200 may use data of a certain period (e.g., 1s to 5s,2s to 6s, or 3s to 7s, etc.) in the video it collects as the required face data of the driver a.

After the electronic device 100 acquires the face data of the driver a, the face data acquired by the electronic device may be input into a fatigue monitoring model trained in advance, so as to output the fatigue level of the driver a from the fatigue monitoring model. In some embodiments, the fatigue monitoring model may be, but is not limited to being, trained based on convolutional neural networks (convolutional neural network, CNN).

As a possible implementation manner, the fatigue level of the driver a may be determined based on a mapping relationship between the number of blinks, the number of yawns, the number of nods, and the like of the driver a and the fatigue level within a preset period of time.

For example, taking the blink number as an example, as shown in table 1, the mapping relationship between blink number and fatigue level is shown in table 1. Wherein, when the number of blinks of driver a is detected as 10 times within 10 seconds based on the face data of driver a, it can be determined from table 1 that the fatigue level of driver a is 3. It will be appreciated that a higher fatigue level indicates that driver a is fatigued for a preset period of time.

TABLE 1

Number of blinks n in 10 seconds	Fatigue grade
		n＜5	1
5≤n＜8	2
		n≥8	3

S3602, the electronic device 100 determines a target adjustment value of a first characteristic parameter according to the fatigue level, where the first characteristic parameter is a characteristic parameter of the audio data to be played currently.

In this embodiment, after determining the fatigue level of the driver a, the electronic device 100 may query the mapping relationship between the preset fatigue level and the adjustment value of the characteristic parameter, and determine the target adjustment value of the characteristic parameter of the audio data to be played currently. In some embodiments, the characteristic parameters may include: tone and/or loudness, etc.

As a possible implementation manner, the electronic device 100 may determine the target adjustment value according to a relational expression corresponding to the fatigue level and the preset feature parameter.

Illustratively, the higher the fatigue level, the higher the target adjustment value of the tone, and the more audible stimulus is given to the user. The higher the fatigue level, the higher the target adjustment value of loudness, and the greater the auditory loudness. For example, if the tone corresponds to a relational expression s=0.2×x ² +1, the loudness may correspond to a relational expression of g=0.5×x+1, where x is the fatigue level, and when the fatigue level is level 1, the target adjustment value of the tone is 1.2, and the target adjustment value of the loudness is 1.5.

S3603, the electronic device 100 processes the audio data to be played currently according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter.

In this embodiment, the electronic device 100 may adjust the pitch and/or loudness of the audio data of the navigation sound that is currently required to be broadcasted, so as to obtain the target audio data. Wherein the value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter. For example, when the feature parameter is loudness and the unit of loudness is represented by a standardized value (such as a magnification factor, etc.), if the target adjustment value is 1.5, the electronic device 100 may adjust the loudness of the audio data of the navigation sound that is currently required to be broadcasted, and the adjusted loudness is 1.5 times of the original gain; if the target adjustment value is 10 and the unit of the loudness is expressed in db, the electronic device 100 may adjust the loudness of the audio data of the navigation sound that is currently required to be broadcasted, where the adjusted loudness is the sum of the original volume loudness and the target adjustment value. When the feature parameter is a tone, if the target adjustment value is 1.2, the electronic device 100 may raise the tone of the audio data of the navigation tone currently required to be played to 1.2 times of the original tone based on the pitch-shifting algorithm. Illustratively, the pitch shifting algorithm may be a time domain method such as synchronous waveform superposition method (synchronized overlap-add, SOLA), fixed synchronous waveform superposition method (synchronized overlap-add and fixed synthesis, SOLAFS), time domain pitch synchronous superposition method (time-domain pitch synchronized overlap-add, TD-PSOLA), waveform similarity superposition method (waveform similarity overlap-and-add, WSOLA), or a frequency domain method such as pitch synchronous waveform superposition algorithm (pitch-synchronized overlap-add, PSOLA).

In some embodiments, in order to ensure the definition of the currently required navigation sound, the tone of the audio data of the currently required navigation sound may be processed in a manner of changing the tone and not changing the speed.

For ease of understanding, the following description will be given by taking the time domain method to realize the tone-changing and non-changing. When the time domain method is adopted for adjustment, the mode of 'variable speed without changing tone' and 'resampling' can be generally adopted to achieve the effect of changing tone without changing speed. The audio data of the navigation sound to be broadcast can be subjected to speed changing and tone changing processing, and then resampling processing is performed.

For the variable speed and non-tone modification processing, as shown in fig. 37 (a), the audio data of the navigation sound currently required to be broadcasted may be first subjected to framing processing in the original time domain x. Next, as shown in fig. 37 (b), one frame of data (i.e., x _m ) And adding the frame data to the time domain y. Then, as shown in FIG. 37 (c), the number of sampling points H may be fixed at intervals _a Fetching another frame of data (i.e. x _m+1 ). Finally, as shown in fig. 37 (d), one frame of data (i.e., x) taken out in fig. 37 (b) can be extracted _m ) And another frame data (i.e., x) taken out in (c) of fig. 37 _m+1 ) Waveform superposition is carried out to obtain the product represented by x _m And x _m+1 Reconstructed speech, i.e. audio data on domain y. It should be appreciated that a fixed number of samples H may be taken per interval during the reconstruction of speech _a And taking one frame of data, and superposing the taken data to obtain the rebuilt audio data of the navigation sound to be broadcasted. Wherein H is _a The value of (2) may be preset. In addition, when the audio data is reconstructed in the above manner, the number of frames included in the reconstructed audio data is reduced and the sampling point is reduced, but the sampling rate is the same as that of the original audio data, and therefore, inThe speed of sound can become fast when broadcasting to reach the purpose that the speed change is not become tone.

For resampling, a corresponding resampling factor P/Q can be selected to realize resampling by P/Q times, so that the speech speed and the tone after resampling are changed into the original Q/P times. Where P is the upsampling factor and Q is the downsampling factor. The resampling process may include an upsampling process and a downsampling process. The up-sampling process is as follows: and (P-1) sampling points are interpolated between every two adjacent sampling points in the original signal, so that the pitch period of the original signal is changed to be P times of the original pitch period, the duration is changed to be P times of the original pitch period, namely the fundamental frequency is changed to be 1/P times of the original fundamental frequency, the pitch is reduced to be 1/P times of the original pitch period, and the speech speed is changed to be 1/P times of the original pitch period. The down-sampling process is: in the original signal, one sampling point is extracted from every interval (Q-1), so that the pitch period length is 1/Q times that of the original pitch period length, namely the fundamental frequency is Q times that of the original pitch period length, the pitch is Q times that of the original pitch period length, and the speech speed is Q times that of the original pitch period length. By resampling the audio data after the speed change and the tone change according to the resampling factor P/Q, the speech speed and the tone of the audio data can be modulated to be the original Q/P times. The resampling factor P/Q may be obtained from the adjustment value corresponding to the tone, for example, when the target adjustment value corresponding to the tone is 1.5, the resampling factor P/q=1/1.5=2/3.

S3604, the electronic device 100 plays the target audio data.

In this embodiment, the electronic device 100 may play the target audio data after obtaining the target audio data from the audio data of the navigation sound that is currently required to be played. The value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter (namely the characteristic parameter of the audio data of the navigation sound which is required to be broadcasted currently), so that the aim of reminding the driver can be achieved. For example, when the tone of the audio data being played is high and/or the loudness of the sound is high, the sound heard by the driver will be relatively harsh, so that the purpose of stimulating the driver can be achieved, thereby improving the attention of the driver. In some embodiments, when the electronic device 100 is not integrated in the vehicle 200, the electronic device 100 may play the target audio data through its own speaker, or may transmit the target audio data to the vehicle 200 and be played by the speaker of the vehicle 200. When the electronic device 100 is integrated in the vehicle 200, the electronic device 100 may play the target audio data through a speaker of the vehicle 200.

Therefore, when the driver is detected to be tired, the characteristic parameters (such as tone, loudness and the like) of the audio data broadcasted by the electronic device 100 in a navigation manner can be changed according to the fatigue level of the driver, so that the broadcasted audio data can impact the driver in an auditory sense, and the attention of the driver is improved, and safe driving is realized.

In some embodiments, the electronic device 100 may further determine a corresponding prompt voice according to the fatigue level, and broadcast the target audio data and the prompt voice based on a preset broadcasting sequence, so that the broadcasting mode and the language are more vivid and humanized, and the user experience is improved. In addition, if the navigation sound does not need to be broadcasted currently, but the corresponding prompt voice is determined according to the fatigue grade, the electronic device 100 may directly play the prompt voice. For example, the electronic device 100 may query a mapping relationship between a preset fatigue level and a prompt voice according to the fatigue level, and determine the currently required prompt voice. Alternatively, the prompt voice corresponding to each fatigue level may be preset by the user, or may be a template sentence preset in the electronic device 100.

For example, as shown in Table 3, when the fatigue level is 2, it can be determined that the prompt voice is "attention-! The driver is moderately tired, please open the window for ventilation. At this time, if the target audio data is "front 50 meters please turn left", the audio data to be broadcasted by the electronic device 100 may be "front 50 meters please turn left", note that-! The driver is moderately tired, please open the window for ventilation.

TABLE 2

In addition, if there is no navigation sound to be broadcasted currently, the electronic device 100 may determine a corresponding alert voice based on the fatigue level, and broadcast the alert voice.

In addition, the electronic device 100 may determine, according to the fatigue level and the map information in navigation, a prompt voice to be broadcasted. For example, with continued reference to Table 3, when the fatigue level is level 3, the prompt voice to be broadcast is "attention-! Attention-! Drivers have become extremely tired and can stop and rest at xxx intersections/supermarkets/transfer stations that are far from xxx meters. When the electronic device 100 determines that the service area exists at the position 500 m away according to the map information in the navigation, the electronic device 100 can determine that the prompt voice required to be broadcasted is' attention! Attention-! Drivers are extremely tired and can stop and rest in service areas 500 meters away. Wherein the electronic device 100 can compare the "500 m away service area" determined by the map information in navigation with the prompt voice "attention-! Attention-! The drivers are extremely tired, and can splice the signals at the xxx intersections/supermarkets/transfer stations with far xxx meters to obtain the final prompt voice to be broadcasted.

For the audio splicing process, pulse code modulation (pulse code modulation, PCM) data of one piece of audio data is inserted at a certain point in time in PCM data of another piece of audio data, i.e., the splicing of the two pieces of audio data is completed. For example, assuming that one piece of audio data a is [1,2,3,4,5] and the other piece of audio data B is [7,8,9], if it is necessary to insert the audio data B between "3" and "4" in the audio data a, "7", "8", "9" may be inserted between "3" and "4", thereby splicing the audio data a and B together.

In some embodiments, to further increase the driver's attention, the electronic device 100 may also determine the color and/or flashing frequency of the signal lights provided in the vehicle 200, etc. according to the fatigue level; and controlling the signal lamp in the vehicle 200 to work with the determined color and/or blinking frequency, thereby visually generating impact on the driver, further improving the attention of the driver, realizing safe driving, and realizing visual and audible synchronous warning with the navigation sound. For example, the electronic device 100 may query a mapping relationship between a preset fatigue level and a signal lamp based on the determined fatigue level, determine a color and/or a flicker frequency of the signal lamp, and so on. Illustratively, the higher the fatigue level, the brighter the color of the signal lamp may be, and the higher the flicker frequency of the signal lamp may be. For example, as shown in table 2, table 2 shows a mapping relationship between the fatigue level and the color and flicker frequency of the signal lamp, and when the determined fatigue level is level 2, it can be determined that the color of the signal lamp is yellow and the flicker frequency is 60 times per minute.

TABLE 3 Table 3

In some embodiments, when the vehicle 200 is in an autonomous state, driver distraction is generally not required at this time. But often requires the driver to maneuver the vehicle when the road Kuang Jiaocha (e.g., accident-prone road segments) or is in a critical road segment where the user needs to be alerted (e.g., turn-requiring intersection, etc.). Therefore, in order to improve the safety of the vehicle during automatic driving, the electronic device 100 may combine the road condition information outside the vehicle 200 to broadcast the target audio data when the vehicle 200 is in an automatic driving state. In addition, when the vehicle 200 is not in the automatic driving state, if the current driving road condition of the vehicle 200 is poor or the vehicle is in a critical road section where the user needs to be reminded, the electronic device 100 may also play the target audio data to remind the driver to concentrate on.

As one possible implementation, after the driver triggers the autopilot function on the vehicle 200, the vehicle 200 may notify the electronic device 100 of the information that it is in an autopilot state. Thus, the electronic device 100 can learn that the vehicle 200 is in the automatic driving state.

In addition, the vehicle 200 may collect road condition information outside thereof using sensors outside thereof (such as radar, camera, etc.), and transmit the collected information to the electronic device 100. After the electronic device 100 obtains the road condition information outside the vehicle 200, the target audio data can be broadcasted again when the road condition is poor.

By way of example, fig. 38 illustrates a sound processing method in some embodiments of the present application. In fig. 38, the electronic device 100 is a device separate from the vehicle 200, such as a mobile phone, and a connection is established between the electronic device 100 and the vehicle 200 by a short-range communication method such as bluetooth. In fig. 38, a driver uses the electronic device 100 for navigation. In fig. 38, S3801, S3802, S3804, S3805 may refer to the related description in fig. 36, and will not be described here. As shown in fig. 38, the method may include the steps of:

s3801, the vehicle 200 acquires face data of the driver.

S3802, the vehicle 200 determines the fatigue level of the driver from the face data of the driver.

S3803, the vehicle 200 transmits the fatigue level of the driver to the electronic device 100.

In this embodiment, after the vehicle 200 determines the fatigue level of the driver, the fatigue level may be transmitted to the electronic device 100.

In other embodiments, the vehicle 200 may also directly transmit the face data of the driver acquired in step S3801 to the electronic apparatus 100. Further, the electronic device 100 may determine the fatigue level of the driver from the face data of the driver.

S3804, the electronic device 100 determines a target adjustment value of a first feature parameter according to the fatigue level, where the first feature parameter is a feature parameter of the audio data to be played currently.

S3805, the electronic device 100 processes the audio data to be played currently according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter.

S3806, the electronic device 100 transmits the target audio data to the vehicle 200.

In this embodiment, after determining the target audio data, the electronic device 100 may send the target audio data to the vehicle 200. S3807, the vehicle 200 plays the target audio data. In this embodiment, after the vehicle 200 acquires the target audio data, the target audio data may be played.

In other embodiments, in step S3806, the electronic device 100 may play the target audio data through its own speaker, i.e., the electronic device 100 does not need to send the target audio data to the vehicle 200.

By way of example, fig. 39 illustrates a sound processing method in some embodiments of the present application. In fig. 39, the electronic device 100 is a device separate from the vehicle 200, such as a mobile phone, and a connection is established between the electronic device 100 and the vehicle 200 by a short-range communication method such as bluetooth. In fig. 39, the driver uses the electronic device 100 for navigation. In fig. 39, S3901 to S3906, reference may be made to the foregoing related description, and the description is omitted here. As shown in fig. 39, the method may include the steps of:

s3901, the vehicle 200 acquires face data of the driver.

S3902, the vehicle 200 determines a fatigue level of the driver from the face data of the driver.

S3903, the vehicle 200 determines a target adjustment value of a first characteristic parameter according to the fatigue level, wherein the first characteristic parameter is a characteristic parameter of the audio data to be played currently.

S3904, the electronic device 100 transmits audio data to be played to the vehicle 200.

S3905, the vehicle 200 processes the audio data to be played according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than that of the first characteristic parameter.

S3906, the vehicle 200 plays the target audio data.

In other embodiments, in step S3906, the vehicle 200 may also send the target audio data to the electronic device 100, so that the electronic device 100 plays the target audio data.

By way of example, fig. 40 illustrates a sound processing method in some embodiments of the present application. In fig. 40, the electronic device 100 is a device separate from the vehicle 200, such as a mobile phone, and a connection is established between the electronic device 100 and the vehicle 200 by a short-range communication method such as bluetooth. In fig. 40, a driver uses the electronic device 100 for navigation. In fig. 40, S4001 to S4007 can be referred to the foregoing related description, and will not be repeated here. As shown in fig. 40, the method may include the steps of:

s4001, the vehicle 200 acquires face data of the driver.

S4002, the vehicle 200 determines the fatigue level of the driver from the face data of the driver.

S4003, the vehicle 200 determines a target adjustment value of a first characteristic parameter according to the fatigue level, wherein the first characteristic parameter is a characteristic parameter of the audio data to be played currently.

S4004, the vehicle 200 transmits the target adjustment value to the electronic apparatus 100.

S4005, the electronic device 100 processes the audio data to be played according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter.

S4006, the electronic apparatus 100 transmits the target audio data to the vehicle 200.

S4007, the vehicle 200 plays the target audio data.

In other embodiments, in step S4006, the electronic device 100 may play the target audio data through its own speaker, i.e., the electronic device 100 does not need to send the target audio data to the vehicle 200.

It will be appreciated that in the embodiments illustrated in fig. 38 to 40 described above, the data that can be interacted between the electronic device 100 and the vehicle 200 includes, but is not limited to, face data of the driver, fatigue level of the driver, target adjustment value of the first characteristic parameter, audio data to be played, target audio data, and the like. It should also be understood that the above-described process of determining the fatigue level of the driver, determining the target adjustment value of the first characteristic parameter, processing the audio data to be played, etc. may be performed on the electronic device 100 or may be performed on the vehicle at 200. For example, after the vehicle 200 acquires the face data of the driver, the fatigue level of the driver may be determined by the vehicle 200, or the vehicle 200 may transmit the face data of the driver to the electronic device 100, and the fatigue level of the driver may be determined by the electronic device. For another example, the vehicle 200 may determine the target adjustment value of the first characteristic parameter according to the fatigue level of the driver, and send the target adjustment value to the electronic device 100, and the electronic device 100 may determine the target adjustment value of the first characteristic parameter according to the fatigue level of the driver. This application is not intended to be limiting. In some possible implementations, each step in the foregoing embodiments may adaptively adjust the execution body according to the actual situation, and the adjusted solution is still within the scope of protection of the present application.

6. The user selects a scene in which a plurality of audio data are superimposed and played.

Generally, when people rest, the sleeping-aiding effect can be achieved by playing white noise. But single broadcast white noise brings poor hearing experience to users. Thus, when white noise is played, some other sounds can be played at the same time, for example, songs liked by the user are played at the same time. However, at present, when white noise and other sounds are played simultaneously, the two sounds are simply mixed, so that the fusion effect of the two sounds is poor, and further the hearing experience brought to the user is relatively poor.

In view of this, embodiments of the present application provide a sound processing method. The method can reconstruct white noise selected by a user based on background sound (namely other sounds) selected by the user, so that the user and the user can be fused together more naturally, and better hearing experience is brought to the user.

By way of example, fig. 41 shows a sound processing method. It is to be appreciated that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities, such as, but not limited to, by a speaker, a cell phone, etc. In some embodiments, the method may be performed when the user turns on a target function (e.g., white noise function, etc.), and the user has a need to play audio data. For example, when the method is performed by a device having a display screen such as a mobile phone, a user may start a target function in a system of the device or some Application (APP) on the device, and the user may play a song using the device. When the method is executed by equipment such as a sound box and the like without a display screen, a user can control the equipment such as the sound box and the like through other equipment connected with the equipment such as the sound box and the like so as to start target functions on the equipment such as the sound box and the like, and the user can play songs by using the equipment such as the sound box and the like.

As shown in fig. 41, the sound processing method may include the steps of:

s4101, acquiring first audio data and second audio data.

In this embodiment, the first audio data may be background sound, and the second audio data may be white noise. By way of example, the background sound may be, but is not limited to, a song.

When the user selects the background sound and the white noise, the first audio data and the second audio data can be acquired from the network or the local database based on the selection of the user. For example, an Application (APP) associated with playing audio data may be configured on an electronic device (such as a cell phone, etc.), where a user may select background sounds and white noise.

In addition, when a user selects a background sound, the background sound may be acquired from a network or a local database based on the user's selection. Meanwhile, the mapping relation between the preset background sound and the white noise can be queried based on the background sound, and the white noise matched with the background sound can be obtained from a network or a local database.

In some embodiments, the first duration of the first audio data may be equal to the second duration of the second audio data, such that both may be played in synchronization.

When the second time length of the second audio data is longer than the first time length of the first audio data, the data with the time length equal to the first time length can be cut from the second audio data, and the cut data is used as the required second audio data. For example, when the first time period is 10 seconds and the second time period is 20 seconds, the first 10 seconds of data in the second audio data may be taken as the required data, or the 5 th to 15 th seconds of data in the second audio data may be taken as the required data.

When the second time length of the second audio data is smaller than the first time length of the first audio data, the plurality of second audio data can be spliced, the data with the time length equal to the first time length is intercepted from the spliced data, and the intercepted data is used as the required second audio data.

S4102, acquiring target audio features of the first audio data, where the target audio features include: loudness at various times and location points of various beats.

In this embodiment, for the loudness of each time, the amplitude of the waveform of each time may be determined according to the waveform diagram of the first audio data in the time domain, so as to determine the loudness of each time. Where one amplitude is the loudness of one moment.

For the position points of each beat, the first audio data can be input into a machine learning model which is obtained through training in advance so as to obtain the position points of each beat; the machine learning model can be obtained based on deep learning neural network training. In addition, the first audio data may be processed based on a beat detection algorithm (such as librosa, etc.) to obtain a location point of each beat in the first audio data.

S4103, processing the second audio data according to the target audio characteristics to obtain third audio data.

In this embodiment, the target loudness corresponding to each time in the second audio data may be determined based on the loudness of each time in the first audio data and in combination with a preset proportional relationship between the noise loudness and the music loudness. Further, the loudness of each time in the second audio data may be adjusted, so that the loudness of each time is adjusted to the determined target loudness corresponding to each time. For example, if the loudness of the first time in the first audio data is 10 db, the ratio between the preset noise loudness and the music loudness is 1/2, so that it can be determined that the target loudness of the first time in the second audio data is 5 db. Further, the loudness of the second audio data at the first time may be adjusted to 5 db.

In addition, the pitch of the second audio data may be adjusted based on the position points of the respective beats so that the pitch of the second audio data matches the tempo of the first audio data. For example, when the first audio data is relaxed for a certain period of time, the tone of the second audio data may be reduced for that period of time, so that the second audio data is also gradually relaxed.

As a possible implementation, it may be determined whether to adjust the tone of the second audio data based on the time interval between two adjacent beats in the first audio data and a preset reference beat, and whether to raise or lower the tone when the tone of the second audio data needs to be adjusted.

For example, assume that the preset reference tempo is: beats per minute was 30 beats. When the time interval between two adjacent beats in the first audio data is 1 second, it can be determined that the corresponding beats of the two adjacent beats are 60 beats per minute. At this time, the determined tempo is larger than the reference tempo, indicating that the tempo of the first audio data is faster between the two adjacent beats. Accordingly, the tone in the second audio data can be raised in the same period of time, so that the emotion expressed by the first audio data and the second audio data is the same in the period of time.

Further, after determining that the pitch of the second audio data needs to be adjusted, determining the rhythm according to the two adjacent beats and determining a mapping relation between the preset rhythm and the pitch adjustment, and determining a target pitch adjustment value of the second audio data, which needs to be adjusted, of the pitch in the position points corresponding to the two beats. Next, the pitch of the data in the position points corresponding to the two beats in the second audio data may be adjusted based on the target pitch adjustment value by using a pitch-shifting algorithm. Illustratively, when the target pitch adjustment value is 0.8, the pitch of the data in the position points corresponding to the two beats in the second audio data may be reduced to 0.8 times the original pitch based on the pitch-shifting algorithm. In some embodiments, the purpose of raising the pitch may be accomplished by extracting a number of sample points from the data to be adjusted in an upsampled manner. In addition, the purpose of reducing the tone can also be achieved by inserting a certain number of sampling points from the data to be adjusted in an up-sampling manner.

As yet another possible implementation manner, it may be determined whether to adjust the sound speed of the second audio data (i.e., the audio playing speed) based on the time interval between two adjacent beats in the first audio data and a preset reference tempo, and, when the sound speed of the second audio data needs to be adjusted, whether to increase the sound speed or decrease the sound speed. For the manner of determining whether to raise or lower the speed of sound, reference may be made to the foregoing manner of determining whether to raise or lower the tone, and no further description is given here.

Further, after determining that the sound speed of the second audio data needs to be adjusted, determining the rhythm determined by two adjacent beats and the mapping relation between the preset rhythm and the sound speed adjustment, and determining a target sound speed adjustment value of the second audio data, which needs to be adjusted, of the sound speed in the position points corresponding to the two beats. Next, the sound velocity of the data in the position points corresponding to the two beats in the second audio data may be adjusted based on the target sound velocity adjustment value. For example, when the target tempo adjustment value is 0.8, the tempo of the data in the position points corresponding to the two beats in the second audio data may be reduced to 0.8 times the original tempo. In some embodiments, the purpose of increasing the speed of sound may be achieved by extracting a certain number of sample points from the data to be adjusted in an upsampling manner. In addition, a certain number of sampling points can be inserted from the data to be adjusted in an up-sampling mode, so that the aim of reducing the sound speed can be fulfilled.

It will be appreciated that the tone and the sound speed of the second audio data may be adjusted simultaneously, or alternatively, without limitation.

After processing the second audio data based on the target audio feature, the third audio data may be obtained, and S4104 may be performed.

S4104, playing target audio data, the target audio data being obtained based on the first audio data and the third audio data.

In this embodiment, the first audio data and the third audio data may be subjected to audio mixing processing by a audio mixing algorithm to obtain the target audio data. For example, when the types of the first audio data and the third audio data are both floating point (float), the first audio data and the third audio data may be directly superimposed and mixed to obtain the target audio data. When the types of the first audio data and the third audio data are not float types, the first audio data and the third audio data can be processed by adopting a self-adaptive weighted mixing algorithm, a linear superposition averaging and other mixing algorithms so as to obtain target audio data.

Therefore, the second audio data is modified based on the audio characteristics of the first audio data, so that the first audio data and the second audio data can be fused together more naturally, and better hearing experience is brought to a user.

7. A scene of a video or a moving picture is made.

Generally, in the process of manufacturing a video or a moving picture, a spatial sound effect can be added to an object in the video or the moving picture, so that a user can immersively experience sound similar to that in the real world when the video or the moving picture is watched later, thereby bringing better watching experience. In some embodiments, the video may be edited by the original video, or a video may be generated from multiple pictures, which is not limited herein. The moving picture may be understood as a file in a graphics interchange format (graphics interchange format, GIF).

In view of this, the embodiment of the present application further provides a sound processing method, when a user makes a video or a moving picture on an electronic device, spatial audio may be added to a target object in the video or the moving picture according to the user's own needs, so that the sound of the target object in the video or the moving picture may move along with the movement of the target object, further making the listening feel of the user more realistic, and improving the viewing experience. The sound processing method has no requirement on the environment and the information acquisition equipment, and the audio position of the object in the video or the dynamic picture is consistent with the actual audio position of the object, so that the situation of hearing and looking and cracking can not occur when the subsequent user watches the video, and the user experience is improved.

By way of example, fig. 42 shows a sound processing method. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 42, the method may include the steps of:

s4201, determining N pictures, wherein N is more than or equal to 2.

In this embodiment, the N pictures may be pictures selected by the user. For example: the user can select N pictures from the electronic device such as a mobile phone, so as to make video from the N pictures.

The N pictures may also be pictures taken by the user over a period of time. For example, after a user takes a picture using an electronic device such as a mobile phone, the picture of one week, one month, or one year may be determined as the required N pictures.

The N pictures may also be pictures extracted from the target video selected by the user according to a preset sampling frequency. For example, in the process of extracting N pictures from the target video, the time corresponding to each extracted picture may be recorded. For example, if the sampling frequency is one picture every 1 second (S), and the time of the first picture is 0S, the time of the second picture is 1S, the time of the third picture is 2S, and so on.

The N pictures may also be pictures extracted from the motion pictures. In some embodiments, a moving picture may be understood as being formed by stitching a plurality of pictures, and thus, N pictures may be a plurality of pictures that make up the moving picture.

S4202, determining the time when each picture in the N pictures appears in the target video, wherein the target video is obtained based on the N pictures.

In this embodiment, the N pictures may be uniformly arranged on the playing duration based on a default playing duration required for playing the N pictures, or based on a playing duration set by a user and according to a preset sequence, and each picture is played at equal intervals. For example, if n=10 and the play duration set by the user is 9s, a picture may be placed at 0s, 1s, 2s, …, and 9s, respectively, for playing. The preset order may be based on the time sequence of taking or extracting the pictures, or may be a user-specified order, or the like, for example. Optionally, the duration of the target video may be a default playing duration required for playing N pictures, or a playing duration set by a user.

In some embodiments, when N pictures are extracted from the video selected by the user, a corresponding time of each picture in the video may be used as a time when each picture appears in the target video. At this time, the target video may be the same as the video selected by the user.

In some embodiments, when N pictures are extracted from the moving pictures, a time when each picture appears in the moving picture may be used as a time when each picture appears in the target video. At this time, the target video can be understood as the moving picture. In addition, it is also possible to set a time for each picture individually and to reproduce these pictures as video or moving pictures later. Alternatively, the duration of the target video may be the duration required to play the moving picture.

In some embodiments, an audio data adapted to N pictures may be screened out based on the pictures. And determining the time when each picture in the N pictures appears in the target video according to the determined audio data. For example, N pictures may be input to an artificial intelligence (artificial intelligence, AI) model (e.g., a machine learning model, a neural network model, etc.) to be processed by the AI model to obtain audio data that is adapted to the pictures. The audio data may be data stored in a local database, or may be audio data on a network, which is not limited herein. Alternatively, the duration of the target video may be the duration of the screened audio data.

When the acquired audio data has a long duration, a piece of data may be intercepted therefrom as the desired audio data. Wherein the climax part of the audio data can be taken as the required audio data but is not limited to.

After the desired audio data is obtained, the audio data may be analyzed to determine the location points of the individual beats in the audio data and/or the location points of each bar. The position point of each beat is understood to be the point in time of the start position of each beat, and the position point of each bar is understood to be the point in time of the start position of each bar. Illustratively, the position points of the respective beats in the audio data and/or the position points of each bar may be extracted by an AI model, a beat extraction algorithm, or the like.

Then, the determined playing time length of the audio data can be obtained, and N pictures are uniformly arranged on the playing time length at equal intervals. And adjusting the appearance time of at least one part of the N pictures based on the determined position points of each beat and/or the position points of each bar, so that the appearance time of at least one part of the N pictures can be consistent with the position points of some beats or the position points of some bars, and visual impact change is presented at the key point of the hearing sense, namely, the user can watch the pictures at the key point of the hearing sense, thereby generating consistent impact sense on the viewing, and further improving the user experience.

When the position point of each beat is used to adjust the appearance time of at least a part of the N pictures, for any one picture, when no picture is set at the position point of one beat closest to the appearance time of the picture, the appearance time of the picture can be adjusted to the position point.

For example, as shown in fig. 43, assume that a total of 5 beats of position points are determined, and there are 4 pictures. As shown in fig. 43 (a), after 4 pictures are arranged at equal intervals, the timing at which the picture 1 appears is at the position point of beat 0, the timing at which the picture 4 appears is at the position point of beat 5, neither picture 2 nor picture 3 appears at the position point of the corresponding beat, and picture 2 is closest to the position point of beat 2, and picture 3 is closest to the position point of beat 3. Therefore, as shown in fig. 43 (B), the timing at which the picture 2 appears can be adjusted to the position point of the beat 2, and the timing at which the picture 3 appears can be adjusted to the position point of the beat 3.

In addition, when the pictures are distributed on the position points of two adjacent beats, and other pictures still exist between the position points of the two beats, the time when the pictures appear can not be adjusted, and the pictures can be uniformly distributed between the position points of the two beats, specifically, the pictures can be determined according to the actual situation, and the method is not limited.

When the position point of each bar is adopted to adjust the appearance time of at least a part of the N pictures, for any picture, when no picture is set at the position point of one bar which is closest to the appearance time of the picture, the appearance time of the picture can be adjusted to the position point. In addition, when the distance between the moment when the picture appears and the position point where the bar ends is smaller than the distance between the moment when the picture appears and the position point where the bar starts, the moment when the picture appears can be adjusted to the position point where the bar ends. The specific adjustment method may refer to the adjustment method when the position points of each beat are adopted, and will not be described herein.

In some embodiments, in addition to the desired audio data being screened out by the N pictures, some audio data specified by the user may be used as the desired audio data. And determining the moment of each picture in the N pictures in the target video according to the mode.

S4203, determining target objects contained in each of the N pictures to obtain M target objects.

In this embodiment, each of the N pictures may be input into a target detection model obtained by training in advance, so as to detect a target object included in each picture through the target detection model, thereby obtaining the target object included in each picture. By way of example, the target detection model may be, but is not limited to being, trained based on convolutional neural networks (convolutional neural networks, CNN). By way of example, the target object may be understood as an object in a picture that is capable of producing sound, for example, when an aircraft is included in the picture, the target object in the picture may be an aircraft.

As a possible implementation manner, each picture may be further processed based on a target detection algorithm (such as YOLOv4, etc.), so as to obtain a target object contained in each picture.

As another possible implementation manner, after the target object is acquired based on the target detection model or the target detection algorithm, a selection interface of the target object may also be displayed to the user, so that the user can select the target object required by the user. At this time, the target object is the target object which is selected by the user and is required.

As a further possible implementation manner, the target object included in each picture may also be acquired based on a selection operation performed on the picture by the user. For example, after determining the N pictures, each picture may be presented to the user. When a user views a certain picture, the user can mark a target object in the picture in a manual marking mode.

S4204, determining spatial positions of M target objects in each picture to obtain (M x N) spatial positions, and determining time lengths of each target object in the target video to obtain M first time lengths.

In this embodiment, for determining the spatial positions of M target objects in each picture, a three-dimensional coordinate system may be constructed with the position of the device that takes the picture as the center. The center position of each picture in the N pictures is the origin of the three-dimensional coordinate system. In the three-dimensional coordinate system, the plane formed by the x axis and the y axis can be the plane in which the picture is located. In a three-dimensional coordinate system, the z-axis may represent depth, which describes the actual distance of the target object to the device taking the picture. The position of the target object in the three-dimensional coordinate system may be (x _i ，y _i ，z _i ) And (3) representing. Wherein, after the coordinate system is determined, x _i And y _i And the value of (2) can be determined. For z _i May be acquired by a time of flight (ToF) camera on the device taking the picture or by a pre-trained depth detection model. By way of example, the depth detection model may be, but is not limited to being, based on convolutional neural network training. It should be understood that in this embodiment, the spatial position of the target object may refer to the position of the target object in the three-dimensional coordinate system.

In some embodiments, when the N pictures may be pictures selected by the user, or pictures taken by the user during a period of time, the N pictures may be ordered according to a time of taking each picture. Then, the spatial positions of the target objects contained in each picture can be determined sequentially from far to near according to the time.

Wherein, for the kth target object in the ith picture. When the kth target object does not exist in the pictures before the ith picture, the spatial position of the kth target object in each of the pictures before the ith picture can be considered to be at an infinite position. It should be understood that, in this embodiment, the ith picture may be any one of the N pictures, and the kth target object may be any one of the ith pictures.

For example, referring to fig. 44, fig. 44 (a) shows the (i-1) th picture, fig. 44 (B) shows the i-th picture, fig. 44 (C) shows the (i+1) th picture, and the determined target object is the bird 4301 shown in fig. 44 (B). The bird 4301 is not present in the picture shown in fig. 44 (a), and the photographing time of the picture is before the photographing time of the picture shown in fig. 44 (B), and there is no other picture before the picture shown in fig. 44 (a), and therefore, the bird 4301 can be placed at an infinite position at the spatial position of the picture shown in fig. 44 (a).

When the kth target object does not exist in the (i+1) th picture, a position on a certain boundary on the (i+1) th picture may be taken as a spatial position of the kth target object in the (i+1) th picture. The position on the boundary may be a specified position on the boundary, or may be a position on the boundary of the target object oriented as determined in the i-th picture.

For example, with continued reference to fig. 44, the bird 4301 is not present in the picture shown in fig. 44 (C) (i.e., the (i+1) th picture), and the photographing time of the picture is after the photographing time of the picture shown in fig. 44 (B), therefore, the bird 4301 can be placed at a certain boundary position of the picture shown in fig. 44 (C) at the spatial position of the picture. Since the bird 4301 is moved toward the upper left of the picture in fig. 44 (B), a position at a boundary of the upper left of the picture shown in fig. 44 (C) (such as a position shown in the area 4302) can be regarded as a spatial position of the bird 4301.

When the kth target object exists in the ith picture and the (i+1) th picture and the kth target object does not exist in the (i+2) th picture, the moving direction of the kth target object can be determined according to the spatial positions of the kth target object in the ith picture and the (i+1) th picture, and the position of the (i+2) th picture at the boundary in the moving direction is used as the spatial position of the kth target object in the (i+2) th picture.

For example, referring to fig. 45, fig. 45 (a) shows the i-th picture, fig. 45 (B) shows the (i+1) -th picture, fig. 45 (C) shows the (i+2) -th picture, and the determined target object is the bird 4501 shown in fig. 45 (a) and (B). The bird 4501 is present in both (a) and (B) of fig. 45, but the bird 4501 is not present in (C) of fig. 45. From fig. 45 (a) and (B), it can be determined that the moving direction of the bird 4501 is the direction indicated by the arrow in fig. 4. Since the boundary in the direction indicated by the arrow in fig. 45 (C) is the region 42, the region 42 can be regarded as the spatial position of the bird 4501 in the (i+2) th picture.

Further, when the kth target object is not present in the (i+3) th picture, a position is determined outside the (i+3) th picture according to the moving direction, the moving speed and the time interval between the (i+2) th picture and the (i+3) th picture, and the position is taken as the spatial position of the kth target object at the (i+3) th picture.

For example, with continued reference to fig. 45, fig. 45 (D) shows the (i+3) th picture. In fig. 45 (D), the bird 4501 is not present either. The moving direction (i.e., the direction indicated by the arrow in the figure) and the moving speed of the bird 4501 can be determined from (a) and (B) of fig. 45. Then, from the moving direction and the moving speed, and the time intervals between the pictures shown in fig. 45 (C) and (D), it can be determined that in fig. 45 (D), the bird 4501 can move to the position shown in the area 43. Thus, the position shown in region 43 can be taken as the spatial position of the bird 4501 at the (i+3) th picture. Wherein the time interval between two adjacent pictures is described in detail below.

In some embodiments, for the kth target object in the ith picture. When the kth target object does not exist in the pictures before the ith picture, the spatial position of the kth target object in each picture before the ith picture can be determined by adopting the mode of determining the spatial position of the kth target object in the (i+j) th picture, wherein j is more than or equal to 1, besides the spatial position of the kth target object in each picture before the ith picture is at an infinite position. Specifically, for the spatial position of the kth target object at the (i-1) th picture, the spatial position of the kth target object may be disposed at a certain position of the boundary of the (i-1) th picture, for example, the position of the kth target object at a certain boundary of the (i-1) th picture in the opposite direction of the kth target object in the (i-1) th picture may be used as the spatial position of the kth target object at the (i-1) th picture, and details of the foregoing manner of determining the spatial position of the kth target object at the (i+1) th picture are not repeated herein.

In some embodiments, for the kth target object in the ith picture. When the (i+1) th picture to the (i+j) th picture do not have the kth target object, j is larger than or equal to 1, and the (i+j+1) th picture has the kth target object, determining the spatial position of the kth target object in each of the (i+1) th picture to the (i+j) th picture by taking the ith picture as a reference and determining the spatial position of the kth target object in the (i+j) th picture according to the method, and obtaining a spatial position set { P } _i+1 ,…,P _i+j }. Wherein P is _i+j Is the spatial position of the kth target object in the (i+j) th picture.

Meanwhile, the spatial position of the kth target object in each of the (i+1) th to (i+j) th pictures can be determined based on the (i+j+1) th picture and by determining the spatial position of the kth target object in the (i+j) th picture, and a set of spatial positions { P' _i+1 ,…,P′ _i+j }. Wherein P' _i+j Is the spatial position of the kth target object in the (i+j) th picture.

Then, from the set of spatial locations { P } _i+1 ,…,P _i+j Sum of spatial position set { P' _i+1 ,…,P′ _i+j Determines that the kth target object is in The (i+1) th picture to the spatial position in each of the (i+j) th pictures.

As a possible implementation manner, a weighted average may be performed on two spatial positions of the kth target object in the same picture, and the obtained result is taken as the spatial position of the kth target object in the picture. For example: for the spatial position of the kth target object in the (i+1) th picture, the position may be (P) _i+1 +P′ _i+1 )/2。

As another possible implementation manner, the kth target object has two spatial positions in each of the (i+1) th to (i+j) th pictures. Therefore, for each picture, the distance between two spatial positions of the kth target object in the picture can be determined, so that j distances can be obtained.

Then, a shortest distance (of course, other distances may be selected, but the shortest distance is preferable), and the determined distance is more accurate because the spatial positions obtained by two modes are similar, which means that the picture corresponding to the shortest distance is taken as the target picture.

Then, a weighted average may be performed on two spatial positions of the kth target object in the target picture, and the obtained result is taken as the spatial position of the kth target object in the target picture.

Then, the spatial position of the kth target object in the ith picture can be connected with the spatial position of the kth target object in the target picture to obtain a target connecting line, and the spatial position of the kth target object in each picture between the ith picture and the target picture is determined on the target connecting line. For example, from the spatial position of the kth target object in the ith picture and the spatial position in the target picture, and the time interval between the ith picture and the target picture, the moving speed of the kth target object can be determined. From the moving speed and the time interval between the ith picture and any one picture (i.e. a certain picture between the ith picture and the target picture), the moving distance of the kth target object in the time interval can be determined. And taking the spatial position of the kth target object in the ith picture as a starting point, and finding a position point of the moving distance of the kth target object from the starting point on a target connecting line, wherein the position point is the spatial position of the kth target object on any picture (namely, a certain picture between the ith picture and the target picture). For determining the spatial position of the kth target object in each picture between the target picture and the (i+j+1) th picture, reference may be made to a manner of determining the spatial position of the kth target object in each picture between the i-th picture and the target picture, which is not described herein.

It should be understood that, when the N pictures are pictures extracted from the target video selected by the user according to the preset sampling frequency, or are pictures extracted from the dynamic pictures, the manner of determining the spatial position of the target object included in each picture may refer to the manner that the aforementioned N pictures are the pictures selected by the user, which is not described herein again.

For determining the time length of each target object in the target video, for any one target object, the time length between the first time of the target object and the time of the last picture ending playing can be used as the time length of the target object in the target video.

In addition, when determining the time length of each target object appearing in the target video, the time length of the target video can be used as the time length of each target object appearing in the target video for any target object.

S4205, determining the moving speed of each target object between each adjacent picture according to the target positions of M target objects in each picture and the time intervals of each adjacent picture in N pictures.

In this embodiment, after determining the spatial position of each target object in each picture and determining the time when each picture in the N pictures appears, the moving speed of each target object between each adjacent picture can be determined from the target positions of M target objects in each picture and the time intervals when each adjacent picture in the N pictures appears based on the speed calculation formula.

For example, if the position of the target object P in the ith image is P _i (x _i ,y _i ,z _i ) The position in the (i+1) th image is P _i+1 (x _i+1 ,y _i+1 ,z _i+1 ) The instant of occurrence of the ith image is t _i The (i+1) th image appears at time t _i+1 The moving speed of the target object p between the ith image and the (i+1) th image may be V _i ＝(p _i+1 -p _i )/(t _i+1 -t _i )。

S4206, obtaining Q first audio data according to M target objects, wherein Q is more than or equal to 1 and less than or equal to M, and one first audio data is at least related to one target object.

In this embodiment, the mapping relationship between the preset target object and the audio data may be queried based on each target object, so as to determine the identifier of the first audio data corresponding to each target object; based on the determined identification of each first audio data, screening the first audio data corresponding to each target object from a preset audio library to obtain Q first audio data; at this time, q=m. For example, at least one audio data may be included in the audio library.

As one possible implementation, the user may also select Q target objects from the M target objects, and add their respective associated first audio data for the Q target objects. The first audio data added by the user is audio data selected by the user based on own requirements, for example, the user can add a sound made by a train to the airplane, can add a sound made by the airplane to the airplane, and the like. In addition, the first audio data added by the user may be data in a local audio library, or may be data on a network, which is not limited herein.

S4207, adjusting the second time length of each first audio data to be equal to the first time length corresponding to the corresponding target object, so as to obtain Q second audio data.

In this embodiment, for any one of the first audio data, when the second time length of the first audio data is longer than the first time length of the target object corresponding to the first audio data in the target video, the data with the time length equal to the first time length may be cut out from the first audio data, so as to obtain the second audio data. For example, when the first time period is 10 seconds and the second time period is 20 seconds, the first 10 seconds of data in the first audio data may be taken as the second audio data, or the 5 th to 15 th seconds of data in the first audio data may be taken as the second audio data.

When the second duration of the first audio data is smaller than the first duration of the target object corresponding to the first audio data in the target video, a plurality of first audio data can be spliced, and data with the duration equal to the first duration is cut out from the audio data obtained after splicing, so that the second audio data is obtained.

S4208, respectively processing the second audio data corresponding to each target object according to the spatial position corresponding to each target object and the moving speed of each target object between each adjacent pictures, so as to obtain Q third audio data.

In this embodiment, for any one target object, the corresponding second audio data may be processed by using a head related transfer function (head related transfer function, HRTF) and a doppler algorithm based on each spatial position corresponding to the target object, a moving speed between each adjacent pictures, and an audio parameter of the corresponding second audio data, so as to obtain third audio data corresponding to the target object. Wherein the third audio data is audio data having a spatial sound effect. The audio parameters of the second audio data may include a sampling rate, a number of channels, a bit rate, etc.

For example, taking the kth object in the ith picture as an example, it is assumed that the kth object does not appear before the ith picture, and moves in a direction away from the origin in the three-dimensional coordinate system after the ith picture. When the positions of the kth target object at the pictures before the ith picture are set to infinity, the audio data corresponding to the kth target object can be not played before the ith picture appears, the audio data corresponding to the kth target object can be played from the ith picture, and after the ith picture, the sound of the target object is controlled to gradually get away according to a certain speed.

When the position of the kth target object at the picture before the ith picture is infinity, before the ith picture appears, the sound of the target object can be controlled to gradually move to the side of the user according to a certain speed, and after the ith picture, the sound of the target object is controlled to gradually move away according to a certain speed.

In some embodiments, when the kth target object is first occurring, the sound size of the audio data corresponding to the target object may be preset, or may be determined based on the spatial position in the picture where the target object is located. For example, the mapping relationship between the preset distance and the sound size can be queried based on the distance between the spatial position of the target object in the picture and the origin of the three-dimensional coordinate system, so as to determine the sound size of the audio data corresponding to the target object in the picture.

S4209, obtaining the target video according to the Q pieces of third audio data and the N pictures.

In this embodiment, the mixing processing may be performed on the Q third audio data based on a mixing algorithm, so as to obtain spatial environment audio related to N pictures. In addition, in S4202, when audio data adapted to N pictures need to be screened out, or a certain audio data is specified by the user, the audio data may be mixed with Q pieces of third audio data to obtain spatial environmental audio related to N pictures.

After the spatial environment audio related to the N pictures is obtained, the spatial environment audio and the N pictures can be combined through a ffmpeg technology or a java CV technology, so that a video with the spatial audio is generated, and the target video is obtained.

In some embodiments, when N pictures are extracted from a certain video, the obtained spatial environment audio may be synthesized with the video corresponding to the N pictures, so as to generate a video with spatial audio effect, i.e. obtain the target video.

Therefore, the finally obtained target video is the video with the spatial sound effect, and in the playing process, the sound which is heard by the user and related to the target object moves along with the movement of the target object, so that the sound follows the picture, and people feel to be personally on the scene.

Next, based on the foregoing, the embodiment of the present application further provides a sound processing method.

By way of example, fig. 46 shows a sound processing method. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 46, the method may include the steps of:

s4601, acquiring a target parameter, where the target parameter includes environment information associated with the target device and/or status information of the user.

In this embodiment, the environmental information associated with the target device may include one or more of the following:

environmental data of an area where the target device is located; the method comprises the steps that first audio data and second audio data are required to be played simultaneously in an environment where target equipment is located, the first audio data and the second audio data are played through the same equipment, wherein the first audio data are continuously played audio data in a first time period, and the second audio data are sporadically played audio data in the first time period; a target position of a target device in a target space, the target space being configured with at least one speaker; a target position of a picture generated by the target device in a target space, wherein at least one loudspeaker is configured in the target space; or, the running speed of the vehicle on which the target device is mounted.

The status information of the user associated with the target device may include one or more of the following:

a target distance between the target device and the head of the target user, a target position of the head of the target user in a target space, wherein at least one loudspeaker is configured in the target space; a fatigue level of the user; first audio data and second audio data selected by a user; or a picture, a video selected by the user, or audio data added by the user for the target object.

For the manner of obtaining the target parameter, reference may be made to the description in the foregoing embodiments, which is not repeated here.

S4602, processing the original audio data according to the target parameters to obtain target audio data, wherein the target audio data is matched with the environment information and/or the state information.

In this embodiment, after the target parameter is obtained, the original audio data may be processed according to the target parameter, so that the original audio data may be matched with the target parameter, thereby constructing the audio data to be played that is adapted to the current environment or the state of the current user, so that the audio data to be played may be fused with the current environment or the state of the current user, and user experience is improved. Wherein after processing the original audio data, target audio data may be obtained, which may be matched to the environment information and/or the status information.

S4603, outputting the target audio data.

In this embodiment, after the target audio data is acquired, the target audio data may be output.

In this way, the target audio data is matched with the current environment or the current user state, so that the target audio data can be fused with the current environment or the current user state, and the user experience is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments in this application. In addition, in some possible implementations, each step in the foregoing embodiments may be selectively performed according to practical situations, and may be partially performed or may be performed entirely, which is not limited herein. All or part of any features of any of the embodiments of the present application may be freely, and arbitrarily combined without conflict. The combined technical scheme is also within the scope of the application.

It is appreciated that the electronic device 100 referred to in embodiments of the present application may be a cell phone, tablet, desktop, laptop, handheld, notebook, ultra-mobile personal computer (UMPC), netbook, as well as a cellular phone, personal digital assistant (personal digital assistant, PDA), augmented reality (augmented reality, AR) device, virtual Reality (VR) device, artificial intelligence (artificial intelligence, AI) device, wearable device, vehicle-mounted device, smart home device, and/or smart city device, among others. Exemplary embodiments of the electronic device include, but are not limited to, electronic devices that are equipped with iOS, android, windows, hong system (Harmony OS) or other operating systems, wherein the specific type of electronic device is not particularly limited in the embodiments of the present application.

The electronic device 100 according to the embodiment of the present application is described below. Fig. 47 shows a schematic configuration of the electronic device 100. Referring to fig. 47, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, a subscriber identity module (subscriber identification module, SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180K through an I2C interface, such that the processor 110 communicates with the touch sensor 180K through an I2C bus interface to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication. In some embodiments, the audio module 170 may be used to encode and decode audio signals. In some embodiments, the audio module 170 may also be used to perform audio processing on the audio signal, such as adjusting the gain of the audio signal, and the like.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (camera serial interface, CSI), display serial interfaces (display serial interface, DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the photographing functions of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present invention is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code divisionmultiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, the electronic device 100 may range using the distance sensor 180F to achieve quick focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180J is for detecting temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180J in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing the electronic device 100 to be abnormally shut down. In other embodiments, when the temperature is below a further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperatures.

The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, bone conduction sensor 180M may acquire a vibration signal of a human vocal tract vibrating bone pieces. The bone conduction sensor 180M may also contact the pulse of the human body to receive the blood pressure pulsation signal. In some embodiments, bone conduction sensor 180M may also be provided in a headset, in combination with an osteoinductive headset. The audio module 170 may analyze the voice signal based on the vibration signal of the sound portion vibration bone block obtained by the bone conduction sensor 180M, so as to implement a voice function. The application processor may analyze the heart rate information based on the blood pressure beat signal acquired by the bone conduction sensor 180M, so as to implement a heart rate detection function.

In some embodiments, the electronic device 100 may process data collected by at least one sensor based on a pedestrian dead reckoning (pedestrian dead reckoning, PDR) algorithm to obtain a user's motion state, such as direction of movement, speed of movement, and so forth.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touching different areas of the display screen 194. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195, or removed from the SIM card interface 195 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 195 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, i.e.: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In this embodiment, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.

Fig. 48 is a software configuration block diagram of the electronic device 100 of the embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively. The application layer may include a series of application packages.

As shown in fig. 48, the application package may include applications for cameras, gallery, calendar, phone calls, maps, navigation, WLAN, bluetooth, music, video, short messages, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 48, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

Android run time includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (media library), three-dimensional graphics processing library (e.g., openGL ES), 2D graphics engine (e.g., SGL), etc.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

It should be appreciated that embodiments of the present application may be applicable to Android, IOS, or hong mong, etc. systems.

It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims

1. A sound processing method, the method comprising:

acquiring target parameters, wherein the target parameters comprise environment information associated with target equipment and/or state information of a user;

processing the original audio data according to the target parameters to obtain target audio data, wherein the target audio data is matched with the environment information and/or the state information;

outputting the target audio data.

2. The method of claim 1, wherein the target parameter comprises the environmental information, the environmental information comprising environmental data of an area in which the target device is located;

the processing the original audio data according to the target parameters to obtain target audio data specifically includes:

according to the environment data, N sound objects associated with the environment data are determined, wherein N is more than or equal to 1;

acquiring white noise corresponding to each sound object to obtain N pieces of audio data, wherein each piece of audio data is associated with one sound object;

And synthesizing the N pieces of audio data to obtain the target audio data.

3. The method according to claim 2, wherein the obtaining white noise corresponding to each of the sound objects, to obtain N pieces of audio data, specifically includes:

and inquiring an atomic database based on the N sound objects to obtain the N audio data, wherein the atomic database is configured with the audio data of each single object in a specific period of time.

4. The method of claim 2, wherein the environmental data includes environmental sounds;

the obtaining white noise corresponding to each sound object to obtain N audio data specifically includes:

extracting audio data of M sound objects from the environmental sound to obtain M audio data, wherein M is more than or equal to 0 and less than or equal to N;

and when M is less than N, inquiring an atomic database based on the rest sound objects in the N sound objects to obtain (N-M) audio data, wherein the atomic database is configured with the audio data of each single object in a specific period of time.

5. The method of claim 4, further comprising, after obtaining the M pieces of audio data:

And adjusting the gain of the sound channel contained in each audio data in the M audio data to a target value.

6. The method of any of claims 2-5, wherein each of the audio data expresses the same emotion as the emotion expressed by the environmental data.

7. The method of claim 1, wherein the target parameter includes the environmental information, the environmental information includes first audio data and second audio data that need to be played simultaneously in an environment where the target device is located, and the first audio data and the second audio data are both played through a same device, wherein the first audio data are audio data that are continuously played in a first period of time, and the second audio data are audio data that are accidentally played in the first period of time;

acquiring the second audio data to be played;

extracting third audio data to be played from the first audio data according to the second audio data, and performing target processing on the third audio data to obtain fourth audio data, wherein playing time periods corresponding to the second audio data and the fourth audio data are the same, and the target processing comprises voice elimination or voice reduction;

According to the second audio data, determining a first gain required to be adjusted for the second audio data, and adjusting the gain of each channel in the second audio data based on the first gain to obtain fifth audio data;

determining a second gain to be adjusted for the fourth audio data according to the fourth audio data or the fifth audio data, and adjusting the gain of each channel in the fourth audio data based on the second gain to obtain sixth audio data;

and obtaining the target audio data based on the fifth audio data and the sixth audio data.

8. The method of claim 7, wherein the second audio data is first data or the fourth audio data is first data;

the determining, according to the first data, a gain to be adjusted by the first data specifically includes:

acquiring audio features of the first data, the audio features including one or more of: time domain features, frequency domain features, or music theory features;

and determining the gain to be adjusted for the first data according to the audio characteristics.

9. The method according to claim 7, wherein determining the second gain to be adjusted for the fourth audio data based on the fifth audio data, comprises:

obtaining a maximum loudness value of the fifth audio data;

and determining the second gain according to the maximum loudness value of the fifth audio data and a first proportion, wherein the first proportion is the proportion between the maximum loudness value of the second audio data and the maximum loudness value of the fourth audio data.

10. The method according to any of claims 7-9, wherein after determining the second gain, the method further comprises:

and correcting the second gain based on the first gain.

11. The method according to any of claims 7-10, wherein after determining the second gain, the method further comprises:

determining that the second gain is greater than a preset gain value;

and updating the second gain to the preset gain value.

12. The method according to any one of claims 7-11, wherein the adjusting the gain of each channel in the fourth audio data based on the second gain specifically includes:

After the fourth audio data play starts, and within a first duration of a first preset time from the moment when the fourth audio data play starts, gradually adjusting the gain of each channel in the fourth audio data to the second gain according to a first preset step length;

and gradually adjusting the gain of each channel in the fourth audio data from the second gain to a preset gain value according to a second preset step length in a second duration which is a second preset time from the time when the fourth audio data is finished.

13. The method according to any one of claims 7-11, wherein the adjusting the gain of each channel in the fourth audio data based on the second gain specifically includes:

gradually adjusting the gain of each channel in the fourth audio data to the second gain according to a first preset step length in a first duration which is a first preset time from the time when the fourth audio data starts to be played before the fourth audio data starts to be played;

and after the fourth audio data is played, and within a second duration of a second preset time from the time when the fourth audio data is played, gradually adjusting the gain of each channel in the fourth audio data from the second gain to a preset gain value according to a second preset step length.

14. The method of claim 1, wherein the target parameter comprises the environmental information, the environmental information comprising a target location of the target device in a target space in which at least one speaker is configured;

determining the distance between the target device and N speakers to obtain N first distances, wherein N is a positive integer, and the N speakers and the target device are in the same space;

constructing a target virtual speaker group according to the N first distances and the N speakers, wherein the target virtual speaker group consists of M target virtual speakers, the M target virtual speakers are positioned on a circle taking the position of the target equipment as the center and taking the target distance in the N first distances as the radius, the value of M is equal to the number of speakers required for constructing space surround sound, the arrangement mode of the M target virtual speakers is the same as the arrangement mode of speakers required for constructing space surround sound, and each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the N speakers;

And adjusting the gain of each channel in the original audio data according to the gain required to be adjusted in the audio signals corresponding to the speakers which are in the N speakers and are associated with the target virtual speaker, so as to obtain the target audio data.

15. The method of claim 14, wherein the target distance is a minimum of the N first distances.

16. The method according to claim 14 or 15, wherein said constructing a target virtual speaker group according to said N first distances and said N speakers, specifically comprises:

determining gains to be adjusted for audio signals corresponding to all speakers except a target speaker in the N speakers by taking the target distance as a reference, so as to construct a first virtual speaker group, wherein the first virtual speaker group is a combination of speakers obtained by virtualizing the N speakers to a circle with the target device as a center and the target distance as a radius, and the target speaker is a speaker corresponding to the target distance;

and determining the target virtual speaker group according to the first virtual speaker group and the arrangement mode of speakers required for constructing the space surround sound, wherein a center speaker in the target virtual speaker group is positioned in a preset angle range in the current direction of the target equipment.

17. The method according to claim 14 or 15, wherein said constructing a target virtual speaker group according to said N first distances and said N speakers, specifically comprises:

according to the N speakers, the N first distances, the arrangement mode of the speakers required by space surround sound is constructed, the direction of the target equipment and the position of the target equipment are constructed, a first virtual speaker group is constructed, the first virtual speaker group comprises M first virtual speakers, and each first virtual speaker is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the N speakers;

determining second distances between the target device and each of the first virtual speakers to obtain M second distances;

and the M first virtual speakers are all virtual to a circle taking the position of the target equipment as the center and taking one distance of the second distances as the radius, so that the target virtual speaker group is obtained.

18. The method of any of claims 14-17, wherein prior to said determining the distance between the target device and the N speakers, the method further comprises:

And screening the N speakers from the speakers configured in the space of the target equipment according to the direction of the target equipment, the position of the target equipment and the arrangement mode of the speakers required by building the space surround sound, wherein the N speakers are used for building the space surround sound.

19. The method according to any one of claims 14-18, further comprising:

determining a distance between the target device and each speaker in the target space;

determining delay time of each loudspeaker in the target space when playing audio data according to the distance between the target device and each loudspeaker in the target space;

and controlling each loudspeaker in the target space to play the audio data according to the corresponding delay time.

20. The method of claim 1, wherein the target parameter comprises the environmental information, the environmental information comprising a target position of a picture generated by the target device in a target space in which at least one speaker is configured;

Constructing a virtual space matched with the target space according to the target position, wherein the volume of the virtual space is smaller than that of the target space;

constructing a target virtual speaker group in the virtual space according to the positions of the speakers in the target space, wherein the target virtual speaker group comprises at least one target virtual speaker, and each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to one speaker in the target space;

and adjusting the gain of each channel in the original audio data according to the gain required to be adjusted for the audio signal corresponding to the speaker associated with the target virtual speaker in the target space, so as to obtain the target audio data.

21. The method according to claim 20, wherein the constructing a target virtual speaker group in the virtual space according to the positions of the speakers in the target space specifically includes:

determining the position of each target virtual speaker in the target virtual speaker group in the virtual space according to the proportion between the virtual space and the target space;

And determining gains to be adjusted for audio signals corresponding to the target speakers according to the distances between the target virtual speakers and the target speakers corresponding to the target virtual speakers so as to obtain the target virtual speaker group, wherein the target speakers are speakers in the target space.

22. The method according to claim 20 or 21, characterized in that the method further comprises:

determining the distance between a picture generated by the target device and each loudspeaker in the target space;

determining delay time of each loudspeaker in the target space when playing audio data according to the distance between the picture generated by the target device and each loudspeaker in the target space;

23. The method of claim 1, wherein the target parameter comprises state information of the user, the state information of the user comprising a target distance between the target device and a head of a target user, the head of the target user being at a target location in a target space in which at least one speaker is configured;

constructing a target virtual speaker group according to the target distance, the target position and the positions of the speakers in the target space, wherein the target virtual speaker group comprises at least one target virtual speaker, each target virtual speaker is obtained by adjusting the gain of an audio signal corresponding to one speaker in the target space, and each target virtual speaker is positioned on a circle taking the target position as a circle center and the target distance as a radius;

24. The method of claim 23, wherein after constructing a target virtual speaker group based on the target distance, the target position, and the positions of the speakers in the target space, further comprising:

constructing a first virtual speaker group according to the target virtual speaker group, wherein the first virtual speaker group is composed of M virtual speakers, the M virtual speakers are positioned on a circle which takes the target position as a center and takes the target distance as a radius, the value of M is equal to the number of speakers required by constructing space surround sound, the arrangement mode of the M virtual speakers is the same as the arrangement mode of speakers required by constructing space surround sound, and each virtual speaker in the M virtual speakers is obtained by adjusting the gain of an audio signal corresponding to at least one speaker in the target space;

The adjusting the gain of each channel in the original audio data according to the gain to be adjusted in the target space and corresponding to the speaker associated with the target virtual speaker, to obtain the target audio data specifically includes:

and adjusting the gain of each channel in the original audio data according to the gain required to be adjusted for the audio signals which are in the target space and correspond to the speakers associated with the M virtual speakers, so as to obtain the target audio data.

25. The method of claim 1, wherein the target device is located in a vehicle, the target parameter comprising the environmental information, the environmental information comprising one or more of a travel speed, a rotational speed, and an opening degree of an accelerator pedal of the vehicle;

determining first audio data from original audio data according to at least one of the running speed, the rotating speed and the opening of the accelerator pedal, wherein the first audio data is obtained by performing telescopic transformation on target audio particles in the original audio data based on the running speed;

Determining acceleration of the vehicle according to the running speed, adjusting gains of all sound channels in the first audio data according to the acceleration to obtain second audio data, and determining a target speed of a sound field in the vehicle moving towards a target direction;

determining a virtual position of a sound source of the target audio data according to the target speed;

according to the virtual positions, determining target gains to be adjusted of audio signals corresponding to a plurality of speakers in the vehicle, and obtaining F target gains, wherein F is more than or equal to 2;

and according to the F target gains, adjusting the gains of all the channels in the second audio data to obtain the target audio data.

26. The method of claim 25, further comprising, prior to adjusting the gain of each channel in the first audio data based on the travel speed:

determining that the change value of the running speed exceeds a preset speed threshold value; and/or

And determining that the adjustment value corresponding to the gain of each channel in the first audio data is smaller than or equal to a preset adjustment value, wherein when the target adjustment value corresponding to the gain of the target channel in the first audio data is larger than the preset adjustment value, the target adjustment value is updated to be the preset adjustment value.

27. The method according to claim 25 or 26, wherein the target parameter further comprises an acceleration duration of the vehicle, the method further comprising:

and controlling the atmosphere lamp in the vehicle to work according to the acceleration duration.

28. The method of claim 1, wherein the target parameter comprises status information of the user, the status information comprising a fatigue level of the user;

determining a target adjustment value of a first characteristic parameter according to the fatigue grade, wherein the first characteristic parameter is a characteristic parameter of original audio data to be played currently, and the first characteristic parameter comprises a tone and/or loudness;

and processing the original audio data according to the target adjustment value to obtain target audio data, wherein the value of the characteristic parameter of the target audio data is higher than the value of the first characteristic parameter.

29. The method according to claim 28, wherein said outputting said target audio data, in particular comprises:

determining a first target prompt tone according to the fatigue level;

And outputting the target audio data and the first target prompt voice according to a preset broadcasting sequence.

30. The method according to claim 28 or 29, wherein the method further comprises:

determining a second target prompt tone according to the fatigue level and the map information;

and outputting the second target prompt tone.

31. The method of any one of claims 28-30, wherein the target device is located in a vehicle;

before the outputting the target audio data, the method further includes:

and determining that the vehicle is in an automatic driving state, the road condition of the road section where the vehicle is located is lower than a preset road condition threshold value, and/or determining that the road section where the vehicle is located is a preset road section.

32. The method of claim 1, wherein the target parameter comprises state information of the user, the state information comprising user-selected first audio data and second audio data;

determining a first audio feature of the first audio data, the first audio feature comprising: loudness at various moments and/or position points of various beats;

Adjusting a second audio feature of the second audio data according to the first audio feature to obtain third audio data, wherein the second audio feature comprises at least one of loudness, tone and speed of sound;

and obtaining the target audio data according to the first audio data and the third audio data.

33. The method of claim 32, wherein the first audio feature comprises: the loudness of each instant of the first audio data, the second audio feature comprising loudness;

the adjusting the second audio feature of the second audio data according to the target audio feature specifically includes:

determining target loudness corresponding to each moment in the second audio data according to the loudness of each moment and the preset loudness proportion;

and adjusting the loudness of each moment in the second audio data to the target loudness corresponding to each moment in the second audio data.

34. The method of claim 32 or 33, wherein the target audio feature comprises: the position point of each beat, the second audio feature comprises a tone and/or a sound speed;

the adjusting the tone of the second audio data according to the target audio feature specifically includes:

For any two adjacent beats in the first audio data, determining a target rhythm corresponding to the any two adjacent beats according to the any two adjacent beats;

determining a target adjustment value of a second audio feature of the second audio data in a position point corresponding to any two adjacent beats according to the target rhythm;

and adjusting the second audio characteristics of the second audio data in the position points corresponding to any two adjacent beats according to the target adjustment value.

35. The method of claim 1, wherein the target parameter comprises status information of the user, the status information comprising one or more of: the picture, the video or the audio data added by the user for the target object;

determining N pictures, wherein N is more than or equal to 2;

determining target objects contained in each picture in the N pictures to obtain M target objects, wherein M is more than or equal to 1;

determining the spatial position of each target object in each picture in the N pictures, and determining the time length of each target object in a target video to obtain M first time lengths, wherein the target video is obtained based on the N pictures;

Determining the moving speed of each target object between each adjacent picture according to the space position of each target object and the moment when each adjacent picture in the N pictures appears in the target video;

q first audio data are obtained according to the M target objects, wherein Q is more than or equal to 1 and less than or equal to M, and one first audio data is at least associated with one target object;

adjusting the second time length of each first audio data to be equal to the first time length corresponding to the corresponding target object so as to obtain Q second audio data;

according to the space position of each target object and the moving speed of each target object between each adjacent picture, respectively processing the second audio data corresponding to each target object to obtain Q third audio data;

and obtaining a target video according to the Q pieces of third audio data and the N pictures, wherein the target video comprises the target audio data, and the target audio data is obtained based on the Q pieces of third audio data.

36. The method of claim 35, wherein the method further comprises:

According to the N pictures, fourth audio data matched with the N pictures are determined;

and taking the position point of at least one part of beats in the fourth audio data as the moment when at least one part of the N pictures appears, and/or taking the position point of the beginning or ending of at least one part of bars in the fourth audio data as the moment when at least one part of the N pictures appears.

37. The method according to claim 35 or 36, wherein said determining the spatial position of each of said target objects in each of said N pictures, in particular comprises:

and determining a first space position of a kth target object in an ith picture based on a preset three-dimensional coordinate system aiming at the kth target object in the ith picture, wherein the center point of the three-dimensional coordinate system is the center position of the ith picture, the ith picture is any one picture in the N pictures, and the kth target object is any one target object in the ith picture.

38. The method of claim 37, wherein the method further comprises:

determining that the kth target object does not exist in the (i+1) th picture;

And taking the first position on the first boundary of the (i+1) th picture as the second spatial position of the kth target object in the (i+1) th picture.

39. The method of claim 38, wherein the first boundary is a boundary of the kth target object in the ith picture with a target orientation, the first position is in the (i+1) th picture starting at the first spatial position, and an intersection of a straight line extending in the target orientation and the first boundary.

40. The method of claim 38 or 39, further comprising:

determining that the kth target object does not exist in the (i+2) th picture;

determining a first moving speed and a first moving direction of the kth target object according to the first space position, the second space position and a time interval between the ith picture and the (i+1) th picture;

taking a second position outside the (i+2) th picture as a third spatial position of the kth target object in the (i+2) th picture; the second position is a position point in the first moving direction and separated from the second spatial position in the (i+2) th picture by a first target distance, and the first target distance is obtained according to the first moving speed and a time interval between the (i+1) th picture and the (i+2) th picture.

41. The method of any one of claims 37-40, further comprising:

determining that the kth target object does not exist in the (i-1) th picture, wherein i is more than or equal to 2;

and taking the third position on the second boundary of the (i-1) th picture as a fourth spatial position of the kth target object in the (i-1) th picture.

42. The method of claim 41, wherein the second boundary is a boundary of the kth target object in a direction opposite to a target orientation in the ith picture, the third position is in the (i-1) th picture starting at the first spatial position, and an intersection of a straight line extending in the direction opposite to the target orientation and the second boundary.

43. The method of claim 41 or 42, further comprising:

determining that the kth target object does not exist in the (i-2) th picture, wherein i is more than or equal to 3;

determining a second moving speed and a second moving direction of the kth target object according to the first space position, the fourth space position and a time interval between the ith picture and the (i-1) th picture;

Taking a fourth position outside the (i-2) th picture as a fifth spatial position of the kth target object in the (i-2) th picture; wherein the fourth position is a position point in the opposite direction of the second movement direction and separated from the fourth spatial position in the (i-2) th picture by a second target distance, the second target distance being obtained according to the second movement speed and a time interval between the (i-1) th picture and the (i-2) th picture.

44. The method of any one of claims 37-43, further comprising:

determining that the kth target object does not exist in any of the (i+1) th picture to the (i+j) th picture, j is more than or equal to 2, and the kth target object exists in the (i+j+1) th picture, wherein (i+j+1) is less than or equal to N;

respectively determining the spatial positions of the kth target object in each of the (i+1) th to (i+j) th pictures based on the ith picture to obtain a first set of spatial positions { P } _i+1 ，...，P _i+j }, wherein P _i+j For the spatial position of the kth target object in the (i+j) th picture, and based on the (i+j+1) th picture, determining the spatial positions of the kth target object in each of the (i+1) th to (i+j) th pictures respectively to obtain a second set of spatial positions { P' _i+1 ，...，P′ _i+j And (3) wherein P' _i+j A spatial position of the kth target object in the (i+j) th picture;

and determining the spatial position of the kth target object in each of the (i+1) th picture to the (i+j) th picture according to the first spatial set and the second spatial set.

45. The method of claim 44, wherein determining the spatial position of the kth target object in each of the (i+1) th to (i+j) th pictures according to the first spatial set and the second spatial set, specifically comprises:

according to the first space set and the second space set, determining the distance between two space positions of the kth target object in each of the (i+1) th picture to the (i+j) th picture respectively to obtain j distances;

according to the first space set and the second space set, determining the space position of the kth target object in the (i+c) th picture, wherein the (i+c) th picture is a picture corresponding to one of the j distances, and c is more than or equal to 1 and less than or equal to j;

according to the spatial position of the kth target object in the ith picture, the spatial position of the kth target object in the (i+j+1) th picture, the spatial position of the kth target object in the (i+c) th picture, and the time when each of the ith to (i+j+1) th pictures appears in the target video, determining the spatial position of the kth target object in each of the ith to (i+c) th pictures, and determining the spatial position of the kth target object in each of the (i+c) th to (i+j+1) th pictures.

46. An electronic device, comprising:

at least one memory for storing a program;

at least one processor for executing the programs stored in the memory;

wherein the processor is configured to perform the method of any of claims 1-45 when the program stored by the memory is executed.

47. A computer readable storage medium storing a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1-45.

48. A computer program product, characterized in that the computer program product, when run on an electronic device, causes the electronic device to perform the method of any of claims 1-45.