CN115705839A

CN115705839A - Voice playing method and device, computer equipment and storage medium

Info

Publication number: CN115705839A
Application number: CN202110818669.1A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2023-02-17

Abstract

The application relates to a voice playing method, a voice playing device, computer equipment and a storage medium. The method comprises the following steps: acquiring original voice to be played, and acquiring environmental sound in a playing environment where a playing terminal of the original voice is currently located; carrying out scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located; acquiring a target spatial sound effect template matched with the current acoustic scene, wherein the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene; and processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal. By adopting the method, the quality of the audibility of the played voice can be improved.

Description

Voice playing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for playing a voice, a computer device, and a storage medium.

Background

With the development of signal processing technology, computer equipment can perform various forms of processing on voice signals, for example, spatial sound effects can be added to voice to be played, so that users can hear sounds with more stereoscopic and spatial layering effects.

In the conventional technology, when a spatial sound effect is added to a voice to be played, the voice to be played is usually processed in a fixed manner to obtain a target voice added with the spatial sound effect, so that the played voice does not meet the hearing requirement of a sound receiving object, and the hearing quality is poor.

Disclosure of Invention

In view of the above, it is necessary to provide a voice playing method, apparatus, computer device and storage medium capable of improving the quality of the audibility of the played voice.

A method of voice playback, the method comprising: acquiring original voice to be played, and acquiring environmental sound in a playing environment where a playing terminal of the original voice is currently located; performing scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located; acquiring a target space sound effect template matched with the current acoustic scene, wherein the target space sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene; and processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

A voice playback apparatus, the apparatus comprising: the voice acquisition module is used for acquiring original voice to be played and acquiring the environmental sound in the playing environment where the playing terminal of the original voice is currently located; the scene recognition module is used for carrying out scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located; the template acquisition module is used for acquiring a target spatial sound effect template matched with the current acoustic scene, the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene; and the voice processing module is used for processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

In some embodiments, the target spatial sound effect template comprises a target sound orientation parameter sequence corresponding to the dynamic interaction position relationship; the voice processing module is also used for dividing the original voice into voice segments with the parameter quantity in the target sound orientation parameter sequence; determining a target sound orientation parameter corresponding to the voice fragment in the target sound orientation parameter sequence according to the sequence of the voice fragment in the original voice; and processing the voice segments according to the target sound orientation parameters corresponding to the voice segments to obtain processed voice segments, wherein each processed voice segment forms the target voice according to a voice sequence.

In some embodiments, the candidate spatial sound effect template set includes candidate spatial sound effect templates with fixed voice interaction position relationship; and the template acquisition module is also used for selecting a candidate spatial sound effect template with a fixed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation.

In some embodiments, the desired voice interaction location relationship comprises a desired voice interaction distance; the template acquisition module is further used for selecting a candidate space sound effect template with a fixed sound distance corresponding to the sound effect parameter from the candidate space sound effect template set when the current voice interaction position relationship corresponding to the current acoustic scene is a fixed interaction position relationship, and the candidate space sound effect template with the sound distance matched with the expected voice interaction distance is used as a target space sound effect template matched with the current acoustic scene.

In some embodiments, the desired voice interaction manner comprises a desired voice interaction positional relationship between a sound receiving object and a sound emitting object; the target sound effect parameters comprise position relation sound effect parameters matched with the expected voice interaction position relation; the voice processing module is further used for performing voice processing on the original voice by using the position relation sound effect parameters in the target space sound effect template to obtain target voice, so that the target voice is matched with the expected voice interaction position relation.

In some embodiments, the desired voice interaction positional relationship comprises a desired voice interaction distance and a desired interaction orientation; the position relation sound effect parameters comprise orientation related sound effect parameters and distance related sound effect parameters; the voice processing module is also used for processing the direction of the original voice by using the direction-related sound effect parameter and processing the sound pressure of the original voice by using the distance-related sound effect parameter to obtain target voice; such that the target voice's orientation matches the desired interaction orientation and the target voice's acoustic pressure matches the desired voice interaction distance.

In some embodiments, the target spatial sound effect template is further matched with a target voice interaction effect, the target voice interaction effect is an expected voice interaction effect of a sound receiving object in the current acoustic scene, and the target sound effect parameters include a voice effect adjustment parameter matched with the target voice interaction effect; the voice processing module is also used for processing the original voice by using the position relation sound effect parameter in the target space sound effect template and the voice effect adjusting parameter to obtain target voice.

In some embodiments, the scene recognition module is further configured to obtain a plurality of sound sub-segments corresponding to the environmental sound, and perform feature extraction on the sound sub-segments to obtain sub-segment features; identifying and obtaining a segment acoustic scene corresponding to the sound sub-segment based on the sub-segment characteristics; counting the segment acoustic scenes corresponding to the sound sub-segments to obtain the number of scenes corresponding to each segment acoustic scene; and selecting the acoustic scene with the largest scene number as the current acoustic scene where the playing terminal is located.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring original voice to be played, and acquiring environmental sound in a playing environment where a playing terminal of the original voice is currently located; performing scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located; acquiring a target spatial sound effect template matched with the current acoustic scene, wherein the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene; and processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring original voice to be played, and acquiring environmental sound in a playing environment where a playing terminal of the original voice is currently located; carrying out scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located; acquiring a target spatial sound effect template matched with the current acoustic scene, wherein the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene; and processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

The method, the device, the computer equipment and the storage medium for playing the voice acquire original voice to be played, acquire the environmental sound in the playing environment where the playing terminal of the original voice is currently located, perform scene recognition on the environmental sound, recognize the current acoustic scene where the playing terminal is located, acquire the target space sound effect template matched with the current acoustic scene, process the original voice to acquire the target voice according to the target sound effect parameters in the target space sound effect template, and play the target voice in the playing terminal.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a voice playback method may be implemented in some embodiments;

FIG. 2 is a flow chart illustrating a voice playback method according to some embodiments;

FIG. 3 is a schematic flow chart illustrating processing of the original speech to obtain a target speech in some embodiments;

FIG. 4 is a schematic diagram of the generation of binaural speech in some embodiments;

FIG. 5 is a schematic diagram of a reverberator in some embodiments;

FIG. 6 is a schematic diagram of the structure of an acoustic scene recognition model in some embodiments;

FIG. 7 is a flow diagram illustrating a method for voice playback in some embodiments;

FIG. 8 is a diagram illustrating a selection of a spatial sound effect template prompt in a dialog interface in some embodiments;

FIG. 9 is a block diagram of a voice playback device in some embodiments;

FIG. 10 is a block diagram that illustrates the internal components of a computing device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The voice playing method provided by the application can be applied to the application environment shown in fig. 1. The first terminal 102 and the second terminal 106 communicate with the server 104 through a network. The first terminal 102 may be a terminal corresponding to a sound emitting object, the second terminal 106 may be a terminal corresponding to a sound receiving object, and the second terminal 106 is used for performing voice playing, and thus may be referred to as a playing terminal. The first terminal 102 collects sound of a sound-emitting object to obtain original voice, the original voice is sent to the second terminal 106 through the server 104, the second terminal 106 further collects environmental sound in a current playing environment, and the playing terminal or the server can process the original voice to obtain target voice by combining the environmental sound.

Taking the example that a playing terminal processes original voice, performing scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located, further obtaining a target spatial sound effect template matched with the current acoustic scene, and processing the original voice according to target sound effect parameters in the target spatial sound effect template to obtain target voice. And matching the target space sound effect template with a target voice interaction mode, wherein the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene. The target voice processed by the playing terminal can be stereo voice added with spatial sound effect, and the playing terminal can further play the target voice through an earphone or a combination of more than two speakers.

The first terminal 102 and the second terminal 106 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster composed of a plurality of servers.

In some embodiments, as shown in fig. 2, a voice playing method is provided, which is described by taking the method as an example applied to the second terminal 106 in fig. 1, i.e. a playing terminal, and includes the following steps:

step 202, obtaining an original voice to be played, and obtaining an environmental sound in a playing environment where a playing terminal of the original voice is currently located.

The original voice to be played refers to the voice which needs to be played by the playing terminal. The original voice to be played can be various types of voice, including voice in audio and video, voice in network games or voice in a conversation process, and the like. The current playing environment of the playing terminal of the original voice refers to an environment corresponding to a physical place where the playing terminal is currently located, for example, if the playing terminal is currently located in a forest, the environment corresponding to the forest is the current playing environment of the playing terminal, and if the playing terminal is currently located in a bar, the environment corresponding to the bar is the current playing environment of the playing terminal. The ambient sound may be all or a substantial portion of the environment that may represent the environment, e.g., the forest corresponding ambient sound may be a human voice, a bird voice, a water stream voice, etc. in the forest, and the bar corresponding ambient sound may be a human voice, a music voice, an object impact sound, etc. within the bar.

Specifically, the playing terminal may obtain an original voice to be played from a local server or a server, and when the original voice needs to be played, collect an environmental sound in a playing environment where the playing terminal is currently located.

In some embodiments, the playing terminal is installed with an application program capable of playing voice, and the application program may be a real-time call application program, so that the playing terminal obtains the real-time voice of a call object in the real-time call process to obtain an original voice to be played, collects an environmental sound in a current playing environment, and processes the original voice in combination with the environmental sound to obtain a target voice.

In some embodiments, the application program installed in the play terminal and capable of playing the voice may also be an audio/video entertainment application, so that in the process of playing the audio or the video, the play terminal determines the audio in the played audio or the video as the original voice to be played, collects the environmental sound in the current playing environment, and processes the original voice in combination with the environmental sound to obtain the target voice.

In some embodiments, the application installed in the play terminal and capable of playing the voice may also be an instant messaging application, the application may receive a voice-type message, may display a play control of the message after receiving the voice-type message, and when receiving a trigger operation of a user on the play control, determine the voice in the message as an original voice to be played, collect an environmental sound in a current play environment, further process the original voice in combination with the environmental sound to obtain a target voice, and play the target voice obtained by the processing.

And step 204, performing scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located.

The sound scene refers to a scene divided according to sound, and the acoustic scene is divided according to the environment where the playing terminal is located. In some embodiments, since different venue sounds are generally different, the acoustic scenes may be divided according to at least one of categories of venues, sizes of sounds, or atmospheres of environments, for example, the acoustic scenes may be divided into at least one of subway scenes, supermarket scenes, bar scenes, or seaside scenes, and the like. In other embodiments, considering that sounds in different locations generally have a certain commonality, acoustic scenes can be defined according to actual characteristics of the sounds, for example, considering that sounds in locations such as bars and shopping malls are often noisy and sounds in a forest are quiet, acoustic scenes can be divided into noisy scenes, quiet scenes, and ordinary scenes between the noisy scenes and the quiet scenes, and the ordinary scenes are neither noisy nor quiet scenes.

Specifically, the playing terminal can perform scene recognition on the environmental sound in a machine learning-based mode, after audio features are extracted from the environmental sound at the opposite end, the audio features are input into a trained acoustic scene recognition model, the environmental sound is subjected to scene recognition through the acoustic scene recognition model, and an obtained scene recognition result represents a current acoustic scene, so that the playing terminal can obtain which current acoustic scene is.

The audio feature may be a power spectrum of the ambient sound or may be mel-frequency cepstral coefficients of the ambient sound. The acoustic scene recognition model refers to a machine learning model which can be used for acoustic scene recognition, namely, a process of classifying an acoustic scene. The scene recognition model may specifically be a classification model based on convolution operation, such as a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), a Long Short-Term Memory Network (LSTM), a BiLSTM, a Gate-round Unit (GRU), a BiGRU, and the like. The CNN is a type of feed-forward Neural Networks (fed-forward Neural Networks) containing convolution calculation and having a deep structure; RNN is a recurrent neural network (recurrent neural network) in which sequence data is input, recursion is performed in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain; the LSTM is a time-cycle neural network and is specially designed for solving the long-term dependence problem of a general RNN (cyclic neural network), wherein all RNNs are in a chain form of repeated neural network modules, and a forward LSTM and a backward LSTM are combined into a BiLSTM; the GRU is a kind of RNN, and as with LSTM, is proposed to solve the problems of long-term memory and gradients in backpropagation, the forward GRU and the backward GRU being combined into a BiGRU.

In some embodiments, the acoustic scene recognition model may be deployed locally at the play terminal, and the play terminal may acquire the acoustic scene recognition model from the local storage and input the extracted audio features into the acoustic scene recognition model, so that an acoustic scene recognition result may be obtained quickly. In other embodiments, in order to save the storage space of the playing terminal, the acoustic scene recognition model may also be deployed in the server, the playing terminal may send a scene recognition request carrying the audio feature to the server after obtaining the audio feature, the server inputs the audio feature in the scene recognition request into the acoustic scene recognition model, and returns the obtained scene recognition result to the playing terminal.

And step 206, acquiring a target spatial sound effect template matched with the current acoustic scene, wherein the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene.

The spatial sound effect is the spatial sound effect to be achieved, the spatial sound effect is processed by a certain audio technology to enable a user to hear sounds with more stereoscopic impression and spatial hierarchy, for example, the auditory scene of an actual scene is played and restored through an earphone or a combination of more than two loudspeakers, a listener (namely a sound receiving object) can clearly recognize the direction, the distance and the moving track of different acoustic objects, the listener can feel the omnibearing package feeling of the sound, and the listener can feel the immersive auditory experience of the actual environment.

The spatial sound effect template is a template for performing spatial sound effect processing on a voice, and comprises various sound effect parameters for performing the spatial sound effect processing. The spatial sound effect template can be configured by professionals. Different spatial sound effect templates are matched with different voice interaction modes. The voice interaction method is a method in which the sound receiving object receives the sound of the sound emitting object, and the voice interaction method includes at least one of a positional relationship between both the parties of the interaction and a magnitude of the sound. The matching of the spatial sound effect template and the voice interaction mode means that the played voice can be consistent with the interaction mode of the matched voice interaction mode through the matched spatial sound effect template, for example, if the voice interaction mode is that the spatial distance between a voice sending object and a voice receiving object is not too far and not too close, the spatial sound effect template matched with the voice sending object can achieve the effect of not too far and not too close the played voice after processing the voice.

The sound receiving object refers to an object for receiving sound, and the sound receiving object may be a user corresponding to the playing terminal. The sound emission object may be an object that emits sound. In voice communication, the sound-emitting object is a communication object in the voice communication process, while in audio-video entertainment, the sound-emitting object may be a computer device that emits sound, for example, when music is played through a play terminal, the sound-emitting object may be a play terminal. The desired voice interaction mode of the sound receiving object is a voice interaction mode desired by the sound receiving object, and includes at least one of a positional relationship between both the desired interaction and a magnitude of a desired sound.

In consideration of different hearing experience requirements of the sound receiving objects in different environments, the expected voice interaction modes of the sound receiving objects are different in different acoustic scenes. Examples are as follows: under the condition that a sound receiving object is in a very noisy environment, such as a bar, a voice interaction mode more expected by a listener is in an ear-to-ear type interaction or a very close range interaction, so that the interference of environmental noise to a listening process is avoided; the voice receiving object is in a delicate environment, such as a small forest, the expected voice interaction mode of the voice receiving object is natural and random, and the opposite party who wants to communicate with the voice receiving object also moves freely naturally and without restriction; the sound receiving object is in an open environment, such as a court, and the sound receiving object wishes to hear should have a little reverberation effect, matching the live environment.

Specifically, for each type of acoustic scene, one or more spatial sound effect templates may be configured, that is, for each type of acoustic scene, a corresponding relationship between the acoustic scene and a spatial sound effect model is established, and the playing terminal identifies and obtains a current acoustic scene where the playing terminal is located, that is, according to the corresponding relationship, a spatial sound effect template matched with the current acoustic scene is selected from a plurality of candidate spatial sound effect templates and is used as a target spatial sound effect template, and since the target spatial sound effect template is matched with the current acoustic scene, the target spatial sound effect template is matched with a target voice interaction mode, which is an expected voice interaction mode of a sound receiving object in the current acoustic scene, that is, an expected voice interaction mode of the sound receiving object in the current acoustic scene.

In some embodiments, the playing terminal stores the corresponding relationship between the acoustic scene identifier and the spatial sound effect model identifier, and after the playing terminal identifies and obtains the current acoustic scene, the playing terminal obtains a target spatial sound effect template matched with the current acoustic scene according to the matching relationship between the identifiers. In other embodiments, the corresponding relationship between the acoustic scene identifier and the spatial sound effect model identifier may also be stored in the server, so that after the playing terminal identifies the current acoustic scene, the acoustic scene identifier is sent to the server, and the server obtains a target spatial sound effect template matched with the current acoustic scene according to the matching relationship between the identifiers and returns the target spatial sound effect template to the playing terminal.

In some embodiments, when there are multiple spatial sound effect templates matching with the current acoustic scene, the play terminal may randomly select one of the matching spatial sound effect templates to determine as the target spatial sound effect model. In other embodiments, when a plurality of spatial sound effect templates matched with the current acoustic scene exist in the current acoustic scene, the playing terminal may further display a list of matched spatial sound effect templates, where names and profiles of the matched spatial sound effect models are displayed in the list, and provide a selection control for a user to select a spatial sound effect template, and determine the spatial sound effect template selected by the user as a target spatial sound effect template.

In some embodiments, a plurality of candidate spatial sound effect templates may be pre-configured, including at least one of an ear contact communication template, a roaming communication template, a lecture template, a fanciful communication template, a wraparound template, or a fly-in and fly-out template, where the ear contact communication template refers to a short distance from a sound emitting object to communicate near an ear of a sound receiving object; the step of moving the communication template refers to that the sound emitting object is in a certain distance range and communicates with the sound receiving object in slow motion according to a random or set motion track; the lecture type template refers to that a sound emitting object is located at a middle and long distance, the sound is surging and accompanied with a certain reverberation effect; the odd-form communication template refers to that the position of a sound emitting object is not fixed and the motion trail is random, for example: the former speech appears in the front left of the sound receiving object, the latter speech appears in the back of the sound receiving object, and the next speech is close to the ear of the sound receiving object, thus providing a surprising auditory experience; the surrounding template means that the sound emitting object maintains a certain distance from the sound receiving object and rotates 360 degrees around the sound receiving object in a horizontal direction for communication; the fly-in-fly-out type template refers to a case where a sound emitting object approaches a sound receiving object position from a distant place at a high speed, or moves away from a position close to the sound receiving object at a high speed.

In some embodiments, when the candidate spatial sound effect templates include an ear contact communication template, a roaming mobile communication template, a lecture template, a curiosity communication template, a wraparound template, and a fly-in fly-out template, the acoustic scenes may be classified into three categories including: the method comprises the following steps of a brouhaha scene, a quiet scene and a common scene, wherein the brouhaha scene corresponds to an ear-pasting communication template, the quiet scene corresponds to a strange communication template, a surrounding template and a fly-in fly-out template, and the common scene corresponds to a roaming mobile communication template and a lecture template.

In some embodiments, the target spatial sound effect template is further matched with a target voice interaction effect, where the target voice interaction effect is an expected voice interaction effect of a sound receiving object in the current acoustic scene.

And step 208, processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

The sound effect parameters refer to parameters for performing sound effect processing on the voice, the sound effect parameters comprise parameters corresponding to a voice interaction mode, specifically, the sound effect parameters can be position relation sound effect parameters, the position relation sound effect parameters comprise position relation sound effect parameters and distance relation sound effect parameters, the sound position relation sound effect parameters can be sound positions, and the distance relation sound effect parameters can be distances. The sound effect parameters further include parameters corresponding to the voice interaction effect, and specifically may be parameters for adjusting the sound effect, such as reverberation parameters.

It will be appreciated that the target audio effect parameters include positional relationship audio effect parameters that match the desired voice interaction pattern. The target sound effect parameters comprise a voice effect adjustment parameter matched with the expected voice interaction effect.

Specifically, after the playing terminal acquires a target spatial sound effect template matched with the current acoustic scene, the playing terminal processes the original voice according to a target sound effect parameter in the target spatial sound effect template to obtain target voice, and plays the target voice.

In some embodiments, the processing the original speech according to the target sound effect parameter in the target spatial sound effect template to obtain the target speech may specifically be: and performing voice processing on the original voice according to the position relation sound effect parameters in the target sound effect parameters to obtain the target voice so as to enable the target voice to be matched with the expected voice interaction mode.

In some embodiments, the processing the original speech according to the target sound effect parameter in the target spatial sound effect template to obtain the target speech may specifically be: and performing voice processing on the original voice according to the voice effect adjustment parameter in the target sound effect parameter to obtain the target voice so as to enable the target voice to be matched with the expected voice interaction effect.

The method comprises the steps of obtaining original voice to be played, obtaining environmental sound in a playing environment where a playing terminal of the original voice is located at present, carrying out scene recognition on the environmental sound, obtaining a current acoustic scene where the playing terminal is located through recognition, obtaining a target space sound effect template matched with the current acoustic scene, processing the original voice according to target sound effect parameters in the target space sound effect template to obtain target voice, and playing the target voice in the playing terminal.

In some embodiments, the desired voice interaction manner includes a desired voice interaction positional relationship between the sound receiving object and the sound emitting object; the step of obtaining the target spatial sound effect template matched with the current acoustic scene comprises the following steps: acquiring a candidate spatial sound effect template set; the candidate spatial sound effect template set comprises a plurality of candidate spatial sound effect templates corresponding to different voice interaction position relations; and selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set, wherein the voice interaction position relation corresponding to the target spatial sound effect template is matched with the expected voice interaction position relation corresponding to the current acoustic scene.

The expected voice interaction position relationship between the sound receiving object and the sound emitting object refers to the interaction position relationship between the sound emitting object expected by the sound receiving object and the sound receiving object. The interactive position relationship comprises two types, namely no change of the position relationship and change of the position relationship, wherein the position relationship is unchanged, namely the distance between the sound receiving object and the sound emitting object is fixed, the position relationship is unchanged, namely the position relationship comprises a close distance and a long distance, the close distance refers to that the sound emitting object is in a certain range close to the sound receiving object, and the long distance refers to that the sound emitting object is in a certain range far from the sound receiving object; the change in the positional relationship is a change in the positional relationship between the sound receiving object and the sound emitting object over time, and includes at least one of a change in the speed of movement of the sound emitting object over time and a change in the direction of movement over time.

In this embodiment, a plurality of candidate spatial sound effect templates may be configured in advance, a candidate spatial sound effect template set is formed by these candidate spatial sound effect templates, each candidate spatial sound effect template corresponds to a different voice interaction position relationship, after the play terminal obtains a current acoustic scene, a target spatial sound effect template matched with the current acoustic scene is selected and obtained from the candidate spatial sound effect template set, and since the target spatial sound effect template is matched with the current acoustic scene, a voice interaction position relationship corresponding to the target spatial sound effect template matches with an expected voice interaction position relationship corresponding to the current acoustic scene.

For example, suppose that a current acoustic scene is a noisy scene, and an expected voice interaction position relationship of a sound receiving object in the scene is short-distance interaction, so as to avoid interference of ambient noise on a listening process, so that the playing terminal may select and obtain an ear-fitted communication template from the candidate spatial sound effect template set, where a voice interaction position relationship corresponding to the ear-fitted communication template is short-distance interaction and is matched with an expected voice interaction position relationship.

In the embodiment, the candidate spatial sound effect template matched with the expected voice interaction position relation corresponding to the current acoustic scene is selected as the target spatial sound effect template from the candidate spatial sound effect template set comprising the candidate spatial sound effect templates corresponding to the plurality of different voice interaction position relations, and the spatial sound effect template suitable for the current acoustic scene can be selected, so that the played voice with high listening quality is obtained.

In some embodiments, the candidate spatial sound effect template set comprises candidate spatial sound effect templates with a changed voice interaction position relationship; selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set comprises the following steps: and when the current voice interaction position relation corresponding to the current acoustic scene is the dynamic interaction position relation, selecting a candidate spatial sound effect template with the changed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene.

The candidate spatial sound effect template set comprises candidate spatial sound effect templates with a changed voice interaction position relation, and the candidate spatial sound effect templates with the changed voice interaction position relation are used for processing original voice, and the obtained voice interaction position relation corresponding to the voice interaction mode of the target voice changes along with time. For example, the candidate spatial sound effect template with the changed position relationship of the voice interaction may be any one of an odd-type communication template, a surrounding template and a fly-in fly-out template.

The current voice interaction position relation corresponding to the current acoustic scene refers to an expected voice interaction mode of a sound receiving object in the current acoustic scene. The expected voice interaction pattern of the sound receiving object is different in different acoustic scenes. The dynamic interactive positional relationship means that the positional relationship is dynamically changed with time, and the sound emission object moves in accordance with a set or random movement locus, for example, the sound emission object 360 rotates, or appears in front of the left of the listener in the previous sentence, appears behind the listener in the latter sentence, is adjacent to the ear of the listener in the next sentence, or approaches from a distant position to the position of the sound reception object, or moves away from a position close to the sound reception object.

In this embodiment, because the expected voice interaction modes of the sound receiving objects under different acoustic scenes are different, after the playing terminal identifies and obtains the current acoustic scene, it is equivalent to obtain the current voice interaction position relationship corresponding to the current acoustic scene, and when the current voice interaction position relationship corresponding to the current acoustic scene is the dynamic interaction position relationship, a candidate spatial sound effect template with the changed voice interaction position relationship is selected from the candidate spatial sound effect template set and serves as a target spatial sound effect template matched with the current acoustic scene.

In some embodiments, when the current voice interaction position relationship corresponding to the current acoustic scene is a dynamic interaction position relationship, the current acoustic scene may be a quiet scene.

In some embodiments, when there are multiple candidate spatial sound effect templates with varied voice interaction position relationships, the playing terminal may display a list of candidate spatial sound effect templates with varied voice interaction position relationships for selection by a user, or randomly select one candidate spatial sound effect template with varied voice interaction position relationships as a target spatial sound effect template.

In the above embodiment, when the current voice interaction position relationship corresponding to the current acoustic scene is the dynamic interaction position relationship, the playing terminal may select the candidate spatial sound effect template with the changed voice interaction position relationship from the candidate spatial sound effect template set, and use the candidate spatial sound effect template as the target spatial sound effect template matched with the current acoustic scene, so as to select the spatial sound effect template suitable for the current acoustic scene.

In some embodiments, the target spatial sound effect template comprises a target sound orientation parameter sequence corresponding to the dynamic interaction position relationship; as shown in fig. 3, processing the original speech to obtain the target speech according to the target sound effect parameters in the target spatial sound effect template includes:

step 302, the original voice is divided into voice segments with the number of parameters in the target sound orientation parameter sequence.

The sound direction parameter refers to a parameter related to the sound direction, and the sound direction parameter may be a specific sound direction. The sound orientation parameters corresponding to the dynamic interaction position relation are dynamically changed along with time, and the plurality of sound orientation parameters changing along with time form a target sound orientation parameter sequence according to a time sequence. The number of parameters in the target sound bearing parameter sequence refers to the number of sound bearing parameters corresponding to different times in the target sound bearing parameter sequence, for example, if the sound bearing parameters corresponding to 6 different times are included in the target sound bearing parameter sequence, the number of parameters in the target sound bearing parameter sequence is 6.

Specifically, the playing terminal segments the original voice according to the number of parameters in the target sound orientation parameter sequence to obtain voice segments with the same number of parameters as the number of parameters in the target sound orientation parameter sequence. For example, if the number of parameters in the target sound orientation parameter sequence is 6, the original speech is segmented to obtain 6 speech segments. The segmentation mode may be random segmentation, or equal-duration segmentation is performed on the original voice according to the number of parameters in the target sound direction parameter sequence, for example, the original voice is 6 minutes, and is required to be segmented into 6 voice segments, so that the original voice may be segmented into 6 voice segments with a duration of 1 minute.

And step 304, determining the target sound orientation parameters corresponding to the voice segments in the target sound orientation parameter sequence according to the sequence of the voice segments in the original voice.

The sequence of the voice segments in the original voice refers to a sequence in which the plurality of voice segments are sorted according to the time sequence in the original voice.

Specifically, because the sound direction parameters in the target sound direction parameter sequence are arranged in time sequence, each sound direction parameter has a corresponding sequence, and the number of the voice segments is the same as the number of the parameters in the target sound direction parameter sequence, each voice segment may correspond to one sound direction parameter, so that the playing terminal may determine the sound direction parameters in the target sound direction parameter sequence having the same sequence as the sequence of the voice segments in the original voice as the target sound direction parameters corresponding to the voice segment.

And step 306, processing the voice segments according to the target sound orientation parameters corresponding to the voice segments to obtain processed voice segments, and forming target voice by each processed voice segment according to the voice sequence.

Specifically, the playing terminal processes the direction of the voice segment according to the target sound direction parameter corresponding to the voice segment, generates a stereo voice corresponding to the voice segment as a processed voice segment, and forms a target voice according to the voice sequence for each processed voice segment.

In some embodiments, the playback terminal may process the orientation of the voice segment using an HRTF (Head-Related Transfer Function) to generate a stereo voice corresponding to the voice segment. The HRTF (Head-Related Transfer Function) is a sound source position Function, i.e., a response of a sound transmission path, which integrates ITD (time delay difference), IID (sound pressure difference), and spectral characteristics of body acoustic reflection. The time domain excitation Response data corresponding to the HRTF transfer function is HRIR (Head Related Impulse Response), and the most commonly used HRIR data are CIPIC (Center for Image Processing and Integrated Computing) data set and MIT (Massachusetts Institute of Technology) data set, for example, the CIPIC data set collects 45 measurement objects, each of which is 25 different horizontal orientations, 50 different vertical orientations, and 1250 orientations of time domain measurement data of binaural listening signal.

The HRTF-based stereo generation is to convolve an original mono input signal u (n) with target HRIR data h (n), and output a binaural signal y (n) with reference to the following formula (1):

in determining HRIR data, the target position parameters may be matched to positions in the HRIR dataset, and HRIR data corresponding to positions that match may be determined as target HRIR data.

Since h (n) is divided into left and right channel HRIR data, y (n) is also generated corresponding to the left and right channel signal results, as shown in fig. 4. Referring to FIG. 4, convolution of the original speech and the left channel HRIR data yields a left channel speech signal and convolution of the original speech and the right channel HRIR data yields a right channel speech signal.

In the above embodiment, the original voice is segmented into a plurality of voice segments, and each voice segment is processed through different target sound orientation parameters in the target sound orientation parameter sequence, so that stereo voices in different orientations can be generated for each different voice segment, and thus, a target voice which is matched with the target spatial sound effect template and has a dynamically changed interaction position relationship can be obtained.

In some embodiments, the candidate spatial sound effect template set comprises candidate spatial sound effect templates with fixed voice interaction position relation; selecting and obtaining a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set comprises the following steps: and when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation, selecting a candidate spatial sound effect template with the fixed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene.

The candidate spatial sound effect template set comprises candidate spatial sound effect templates with fixed voice interaction position relations, the candidate spatial sound effect templates with the fixed voice interaction position relations are used for processing original voice, the obtained voice interaction position relations corresponding to the voice interaction modes of the target voice are fixed, and the voice interaction position relations can be interaction distances. The interaction distance corresponding to the candidate spatial sound effect template with fixed voice interaction position relation can be a short distance or a long distance. The candidate spatial sound effect template with the fixed voice interaction position relationship can be any one of an ear-attached communication template, a walking mobile communication template and a lecture template. The interactive distance corresponding to the ear-attached communication template is a short distance, and the walking mobile communication template and the lecture template can be long distances.

The current voice interaction position relation corresponding to the current acoustic scene refers to an expected voice interaction mode of a sound receiving object in the current acoustic scene. The expected voice interaction pattern of the sound receiving object is different in different acoustic scenes. The fixed interaction positional relationship means that the positional relationship is fixed, and the interaction distance between the sound emitting object and the sound receiving object is fixed in the fixed interaction positional relationship.

In this embodiment, because the expected voice interaction modes of the sound receiving objects under different acoustic scenes are different, after the playing terminal identifies and obtains the current acoustic scene, it is equivalent to obtain the current voice interaction position relationship corresponding to the current acoustic scene, and when the current voice interaction position relationship corresponding to the current acoustic scene is a fixed interaction position relationship, a candidate spatial sound effect template with a fixed voice interaction position relationship is selected from the candidate spatial sound effect template set and used as a target spatial sound effect template matched with the current acoustic scene.

In the above embodiment, when the current voice interaction position relationship corresponding to the current acoustic scene is the fixed interaction position relationship, the playing terminal may select the candidate spatial sound effect template with the fixed voice interaction position relationship from the candidate spatial sound effect template set, and use the candidate spatial sound effect template as the target spatial sound effect template matched with the current acoustic scene, so as to select the spatial sound effect template suitable for the current acoustic scene.

In some embodiments, the desired voice interaction positional relationship comprises a desired voice interaction distance; when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation, selecting a candidate spatial sound effect template with a fixed voice interaction position relation from the candidate spatial sound effect template set, and taking the candidate spatial sound effect template as a target spatial sound effect template matched with the current acoustic scene, wherein the candidate spatial sound effect template comprises: and when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation, selecting a candidate spatial sound effect template which has a fixed sound distance corresponding to the sound effect parameter and is matched with the expected voice interaction distance from the candidate spatial sound effect template set, and using the candidate spatial sound effect template as a target spatial sound effect template matched with the current acoustic scene.

The expected voice interaction distance refers to a voice interaction distance expected by a sound receiving object under a current acoustic scene. In some embodiments, when the current acoustic scene is a noisy scene, the sound receiving object expects close-distance communication to avoid interference of ambient noise to the listening process, i.e. the expected voice interaction distance in the current acoustic scene is close. The close distance may be a distance between the sound receiving object and the sound emitting object within a preset distance range. The preset distance range can be set as required.

The fixed sound distance corresponding to the sound effect parameter means that the sound distances corresponding to the times in the sound effect parameter are the same. The matching of the sound distance corresponding to the sound effect parameter and the expected voice interaction distance means that the sound distance value corresponding to the sound effect parameter is consistent with the expected voice interaction distance.

Specifically, the expected voice interaction distances of the sound receiving objects are different in different acoustic scenes, and the expected voice interaction distance of the sound receiving object may be a short distance in some acoustic scenes, while the expected voice interaction distance of the sound receiving object may be a long distance or a medium distance in other acoustic scenes. The long distance here may be that the distance between the sound receiving object and the sound emitting object is greater than a preset distance threshold. The preset distance threshold may be set as desired. The medium distance may be a distance between the near distance and the far distance. In this embodiment, because the expected voice interaction distances of the sound receiving objects are different in different acoustic scenes, when the current voice interaction position relationship corresponding to the current acoustic scene is a fixed interaction position relationship, the playing terminal may select a candidate spatial sound effect template whose sound distance is fixed and whose sound distance matches the expected voice interaction distance from the candidate spatial sound effect template set, and use the candidate spatial sound effect template as a target spatial sound effect template matching the current acoustic scene.

In some embodiments, when the current acoustic scene is a noisy scene, and a sound receiving object expects close-distance communication, the playback terminal may select an ear-fitted communication template from the candidate spatial sound effect template set as a target spatial sound effect template.

In the above embodiment, the candidate spatial sound effect template with the fixed sound distance corresponding to the sound effect parameter and the matching sound distance with the expected speech interaction distance is selected from the candidate spatial sound effect template set, and is used as the target spatial sound effect template matching the current acoustic scene.

In some embodiments, the desired voice interaction manner includes a desired voice interaction positional relationship between the sound receiving object and the sound emitting object; the target sound effect parameters comprise position relation sound effect parameters matched with the expected voice interaction position relation; according to the target sound effect parameters in the target spatial sound effect template, processing the original voice to obtain the target voice comprises the following steps: and performing voice processing on the original voice by using the position relation sound effect parameters in the target space sound effect template to obtain the target voice so as to match the target voice with the expected voice interaction position relation.

Wherein, position relation sound effect parameter refers to the sound effect parameter relevant with the position, and position relation sound effect parameter includes the relevant sound effect parameter of position and the relevant sound effect parameter of distance. The fact that the position relation sound effect parameters are matched with the expected voice interaction position relation means that the position indicated by the position relation sound effect parameters is consistent with the position indicated by the expected voice interaction position relation.

In this embodiment, when the playing terminal processes the original speech according to the target sound effect parameter in the target spatial sound effect template to obtain the target speech, the playing terminal may specifically perform speech processing on the original speech by using the position relationship sound effect parameter in the target spatial sound effect template to obtain the target speech, and since the position relationship sound effect parameter in the target spatial sound effect template is matched with the expected speech interaction position relationship, the obtained target speech is matched with the expected speech interaction position relationship.

In the embodiment, the target voice matched with the expected voice interaction position relation can be obtained, the obtained target voice is adaptive to the current acoustic scene, and the voice listening feeling is high in quality.

In some embodiments, the desired voice interaction positional relationship comprises a desired voice interaction distance and a desired interaction orientation; the position relation sound effect parameters comprise orientation related sound effect parameters and distance related sound effect parameters; performing voice processing on the original voice by using the position relation sound effect parameter in the target space sound effect template to obtain the target voice, wherein the step of: processing the azimuth of the original voice by using the azimuth-related sound effect parameter, and processing the sound pressure of the original voice by using the distance-related sound effect parameter to obtain a target voice; so that the target voice's azimuth matches the desired interaction azimuth and the target voice's sound pressure matches the desired voice interaction distance.

The expected voice interaction distance refers to a voice interaction position expected by a sound receiving object in the current acoustic scene, and the expected interaction orientation refers to a voice interaction orientation expected by the sound receiving object in the current acoustic scene. The orientation-related sound-effect parameters refer to orientation-related sound-effect parameters, which may be specifically orientation data. The distance-related sound-effect parameter refers to a distance-related sound-effect parameter, and the orientation-related sound-effect parameter may be specifically one of distance data or sound pressure data. The sound pressure is used for representing the size of sound, the sound is larger when the sound pressure is larger, and conversely, the sound is smaller when the sound pressure is smaller.

In this embodiment, the position relation sound effect parameters matched with the target sound effect parameters including the expected voice interaction position relation may specifically be: including distance-related sound-effect parameters matching the desired distance of the voice interaction and orientation-related sound-effect parameters matching the desired orientation of the interaction.

Specifically, the playing terminal processes the position of the original voice by using the position-related sound effect parameters to generate a three-dimensional voice, and processes the sound pressure of the three-dimensional voice by using the distance-related sound effect parameters to obtain a target voice, wherein the position-related sound effect parameters are matched with the expected interactive position, and the distance-related sound effect parameters are matched with the expected voice interactive distance, so that the obtained position of the target voice is matched with the expected interactive position, and the sound pressure of the target voice is matched with the expected voice interactive distance.

In some embodiments, the processing, by the play terminal, the direction of the original voice by using the direction-related sound effect parameter may specifically be: and matching the azimuth indicated by the azimuth-related sound effect parameters with the azimuth in the HRIR database, and convolving the HRIR data which are consistent in matching with the azimuth-related sound effect parameters to generate the stereo voice.

In some embodiments, the sound pressure of the sound is heard higher as the sound source is closer to the listener, and conversely, the sound pressure of the sound is heard lower as the sound source is farther from the listener. The formula of the sound pressure level difference between the distance r1 and the distance r2 refers to the following formula (2), where lp2 is the sound pressure corresponding to the distance r2, and lp1 is the sound pressure corresponding to the distance r 1:

lp2＝lp1-20lg(r2/r1) (2)

in practical application, the sound pressure value lp1 at a specific distance r1 can be measured, and the sound pressure lp2 at the target distance r2 can be generated through the above formula relationship. And mapped to corresponding sound signals, thereby realizing auditory perception effects of different distances. E.g., r2/r1 equals 2, the sound pressure is attenuated by 6db.

In other embodiments, the relationship between the sound pressure LP and the sound signal x (n) is as shown in equation (3) below, where LP0 is the offset value. According to the corresponding relation, the amplitude of the input signal can be adjusted to adjust the sound pressure, and further the effect of the distance change between the listener and the sound source is achieved:

in the above embodiment, the azimuth of the original voice is processed by using the azimuth-related sound effect parameter, and the sound pressure of the original voice is processed by using the distance-related sound effect parameter, so as to obtain the target voice, the azimuth of the obtained target voice is matched with the expected interactive azimuth, and the sound pressure of the target voice is matched with the expected voice interactive distance, so as to obtain the target voice with high listening quality.

In some embodiments, the target spatial sound effect template is further matched with a target voice interaction effect, the target voice interaction effect is an expected voice interaction effect of a sound receiving object in a current acoustic scene, and the target sound effect parameters include a voice effect adjustment parameter matched with the target voice interaction effect; performing voice processing on the original voice by using the position relation sound effect parameter in the target space sound effect template to obtain the target voice, wherein the step of performing voice processing on the original voice comprises the following steps: and processing the original voice by using the position relation sound effect parameters and the voice effect adjusting parameters in the target space sound effect template to obtain the target voice.

The expected voice interaction effect refers to a voice interaction effect expected by a sound receiving object in a current acoustic scene, and the voice interaction effect may specifically be a reverberation effect. The voice effect adjustment parameter refers to a parameter for adjusting the effect of sound. The speech effect adjustment parameter may specifically be an attenuation factor or a filter parameter or the like.

Specifically, when the playing terminal performs voice processing on the original voice by using the position relation sound effect parameter in the target spatial sound effect template to obtain the target voice, the playing terminal may specifically process the original voice by using the position relation sound effect parameter in the target spatial sound effect template to obtain a stereo voice, and process the voice effect of the original voice by using the voice effect adjustment parameter to obtain the target voice, where the obtained voice interaction effect of the target voice is matched with the expected voice interaction effect.

Fig. 5 is a schematic diagram illustrating a voice effect adjustment performed by using a voice effect adjustment parameter according to some embodiments. Referring to fig. 5, the voice effect adjustment parameter is a parameter in the reverberator, and the play terminal performs reverberation processing on the voice through the reverberator to obtain a target voice with reverberation effect. Specifically, three branches are arranged in the reverberator, through which a direct speech signal, an early reflected speech signal and a late reflected speech signal can be obtained respectively, and the speech signals output by the three branches are superposed to obtain a target signal with a final reverberation signal. Wherein:

branching one: multiplying the original voice x (n) by an attenuation factor to obtain a direct voice signal;

and a branch II: the original voice x (n) passes through an 18-point filter, and the filtering result is multiplied by an early reflection attenuation factor to obtain an early reflection voice signal;

and branch three: the original voice x (n) passes through an 18-point filter, then passes through the weighted sum of 6 low-pass comb filters, then passes through an all-pass filter, and finally the filtering result is multiplied by a late-stage reflection attenuation factor to obtain a late-stage reflection voice signal.

In the above embodiment, the target sound effect parameters include a sound effect adjustment parameter matched with the target sound interaction effect, and when the playing terminal processes the original sound, the playing terminal can process the original sound by using the position relation sound effect parameters and the sound effect adjustment parameter in the target space sound effect template to obtain the target sound, so that the obtained target sound can be matched with the expected sound interaction effect in the current acoustic scene, the obtained target sound has higher listening quality, and the requirement of a sound receiving object can be better met.

In some embodiments, performing scene recognition on the environmental sound, and obtaining the current acoustic scene where the play terminal is located includes: acquiring a plurality of sound sub-segments corresponding to the environmental sound, and performing feature extraction on the sound sub-segments to obtain sub-segment features; identifying and obtaining a segment acoustic scene corresponding to the sound sub-segment based on the sub-segment characteristics; counting the segment acoustic scenes corresponding to the sound sub-segments to obtain the number of scenes corresponding to each segment acoustic scene; and selecting the acoustic scene with the largest scene number as the current acoustic scene where the playing terminal is located.

The sound sub-segments are obtained by segmenting the environmental sound. The slicing of the ambient sound may be performed at a target time interval. The sub-segments may be characterized by power spectra or mel-frequency cepstral coefficients.

Specifically, the playing terminal may segment the environmental sound to obtain a plurality of sound sub-segments, and each sound sub-segment may be input into a trained acoustic scene recognition model to recognize an acoustic scene corresponding to the sound sub-segment, that is, a segment acoustic scene, and each sound sub-segment corresponds to one segment acoustic scene, so that the environmental sound corresponds to the plurality of segment acoustic scenes, and the number of scenes of the segment acoustic scenes is counted to obtain the number of scenes corresponding to each segment acoustic scene, and then the segment acoustic scene with the largest number of scenes may be selected as the current acoustic scene where the playing terminal is located.

For example, suppose that the environmental sound is divided into 6 sound sub-segments, which are respectively a sound sub-segment 1, a sound sub-segment 2, a sound sub-segment 3, a sound sub-segment 4, a sound sub-segment 5, and a sound sub-segment 6, where a segment acoustic scene corresponding to the sound sub-segment 1 is a scene a, a segment acoustic scene corresponding to the sound sub-segment 2 is a scene a, a segment acoustic scene corresponding to the sound sub-segment 3 is a scene B, a segment acoustic scene corresponding to the sound sub-segment 4 is a scene a, and a segment acoustic scene corresponding to the sound sub-segment 3 is a scene C, and finally, the number of scenes corresponding to the scene a is counted to be 4, the number of scenes corresponding to the scene B is 1, and the number of scenes corresponding to the scene C is 1, and the scene a is taken as a current acoustic scene corresponding to the environmental sound, that is, i.e., the current acoustic scene where the playing terminal is located.

In some embodiments, the scene recognition model may be trained by collecting environmental sounds in different acoustic scenes, segmenting the environmental sounds to obtain sound sub-segments and determining corresponding training labels, inputting each sound sub-segment into the scene recognition model, and training the acoustic scene recognition model with the training labels corresponding to each sound sub-segment as expected outputs until a training stop condition is satisfied, so as to obtain a trained acoustic scene recognition model. When a training stopping condition is met, the training is completed to obtain the trained acoustic scene recognition model, and the training stopping condition can be that the model parameter is not changed any more, the loss reaches the minimum value, the training frequency reaches the maximum iteration frequency, and the like.

In some specific embodiments, as shown in fig. 6, a model structure diagram of the acoustic scene recognition model is shown. Referring to fig. 6, the acoustic scene recognition model includes five layers of Convolutional networks, a first layer is a Dense Convolutional Network (densneet for short), second to fourth layers are GRU networks (Gate recovery Unit, gated cyclic Unit), network parameters of the second to fourth layers are not the same, a fifth layer is a softmax layer, and the softmax layer may also adopt a densneet Network structure. The final output of the acoustic scene recognition model may be a preset scene identifier, for example, if 5 types of scenes are counted, the scene output identifier of the training sample is 5 binary numbers, that is, as 00100, it represents that the sample corresponds to a third type of scene. The final output of the acoustic scene recognition model can also be the probability of recognizing various scenes, and the final result takes the scene with the maximum probability output by all the categories as the final recognition scene result.

In the embodiment, the multiple sound sub-segments corresponding to the environmental sound are obtained, scene recognition is performed on each sound sub-segment to obtain the multiple segment acoustic scenes corresponding to the environmental sound, the number of the scenes of each segment acoustic scene is counted, and finally the segment acoustic scene with the largest number of scenes is selected as the current acoustic scene where the playing terminal is located.

In some embodiments, a method for playing speech is provided. Specifically, the playing terminal collects environmental sounds, performs acoustic scene recognition according to the collected environmental sounds, selects a spatial sound effect template matched with a scene according to a current acoustic scene obtained through recognition, performs virtual stereo generation according to sound effect parameters and original voice in the spatial sound effect template, and finally plays generated stereo. The original speech signal may be a mono speech signal of the collected sound emitting object. In some specific embodiments, the virtual stereo generation process includes: for the collected single-channel signals, based on the sound effect parameter target direction, the reverberation parameter and the distance parameter in the matched spatial sound effect template, firstly generating stereo sound based on an HRTF technology, then adjusting the distance volume, and finally performing reverberation processing to obtain the dual-channel stereo sound signals.

The application further provides an application scenario, and the application scenario applies the voice playing method. In the application scenario, the sound emitting object and the sound receiving object perform a real-time voice call through a network. Specifically, referring to fig. 7, the application of the voice playing method in the application scenario is as follows:

step 702, the playing terminal acquires real-time voice of a sound emitting object as original voice through a network, and collects environmental sound of the current playing environment.

Step 704, the playing terminal segments the environmental sound to obtain a sound sub-segment.

Step 706, extracting the power spectrum or mel-frequency cepstrum coefficient from the sound sub-segments to obtain the sub-segment characteristics corresponding to the sound sub-segments.

Step 708, inputting the sub-segment characteristics into the acoustic scene recognition model to obtain a segment acoustic scene corresponding to the sound sub-segment.

Step 710, counting the segment acoustic scenes corresponding to the sound sub-segments to obtain the number of scenes corresponding to each segment acoustic scene, and selecting the segment acoustic scene with the largest number of scenes as the current acoustic scene where the playing terminal is located.

Among them, the acoustic scene includes three types: the scene system comprises a loud scene, a quiet scene and a common scene, wherein the common scene is a scene between the loud scene and the quiet scene.

And 712, acquiring a target spatial sound effect template matched with the current acoustic scene, wherein the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene.

6 candidate spatial sound effect templates can be configured in advance, wherein the candidate spatial sound effect templates comprise an ear-attaching communication template, a walking mobile communication template, a lecture type template, a strange type communication template, a surrounding type template and a flying-in flying-out type template, and the ear-attaching communication template refers to that a sound emitting object is close to the ear of a sound receiving object for communication; the step-by-step moving communication template refers to that a sound emitting object is in a certain distance range and communicates with a sound receiving object in slow motion according to a random or set motion track; the lecture type template refers to that a sound emitting object is located at a middle and long distance, the sound is surging and accompanied with a certain reverberation effect; the odd-form communication template refers to that the position of a sound emitting object is not fixed and the motion trail is random, for example: the former speech appears in front of the left of the sound receiving object, the latter speech appears behind the sound receiving object, and the next speech is close to the ear of the sound receiving object, so as to provide a surprising hearing experience; the surrounding template means that the sound emitting object maintains a certain distance from the sound receiving object and rotates 360 degrees around the sound receiving object in a horizontal direction for communication; the fly-in-fly-out type template refers to a case where a sound emitting object approaches a sound receiving object position from a distant place at a relatively high speed, or moves away from a position close to the sound receiving object at a relatively high speed.

Each spatial sound effect template comprises a series of sound image directions, distances, reverberation parameters and the like, virtual stereo sound generation is realized through related technologies according to the parameters, and the generated stereo sound is played through earphones or multiple speakers, wherein the multi-speaker playing relates to Upmix conversion from two channels to multiple channels and crosstalk cancellation technology.

When the current acoustic scene is a noisy scene, the playing terminal acquires an ear-pasting communication template as a target spatial sound effect template; when the current acoustic scene is a quiet scene, the playing terminal can select three spatial sound effect templates, namely an odd type communication template, a surrounding type template and a flying-in flying-out type template, and provides a selection interface, and a user can select one of the three spatial sound effect templates as a target spatial sound effect template; when the current acoustic scene is a common scene, the playing terminal can select two spatial sound effect templates, namely a speech type template and a strange type communication template, and provide a selection interface, and a user can select one spatial sound effect template from the two spatial sound effect templates to serve as a target spatial sound effect template.

Or the common scene can be subdivided into natural scenes, such as a small forest, at this time, the playing terminal can select the roaming mobile communication template as the target spatial sound effect template, and can also subdivide the common scene into open scenes, such as a large court, at this time, the playing terminal can select the lecture type template as the target spatial sound effect template.

Step 714, according to the target position in the spatial sound effect template, the playing terminal may determine the position data matched with the target position from the HRIR database, and convolve the HRIR data corresponding to the position data with the original voice signal to generate a virtual stereo.

And 716, determining the sound of the original voice according to the distance parameter in the spatial sound effect template to adjust the stereo, and performing reverberation processing according to the reverberation parameter in the spatial sound effect template to obtain a stereo sound signal with reverberation effect.

In the application scene, different hearing requirements and brand-new hearing experience of a user under different acoustic scenes can be met by combining the real acoustic environment with the virtual space acoustics. The actual field hearing experience is presented in a spatial sound effect mode, and the voice hearing quality is obviously improved.

The application also provides another application scene, and the application scene applies the voice playing method. In the application scenario, the sound emitting object and the sound receiving object perform an instant voice call through a network.

In the application scene, a playing terminal determines a voice message sent by a sound sending object as an original voice, when the playing terminal receives a sound receiving object to trigger the playing operation of the original voice and collects the environmental sound in the current playing environment, the playing terminal performs scene recognition on the environmental sound to obtain the current acoustic scene of the playing terminal, obtains a target space sound effect template matched with the current acoustic scene, processes the voice message according to the target sound effect parameters in the target space sound effect template to obtain a target voice, and plays the target voice. In the application scene, when the playing terminal acquires that the spatial sound effect templates matched with the current acoustic scene comprise a plurality of spatial sound effect templates, the playing terminal can pop up a prompt box for selecting the spatial sound effect templates, and prompts a user to select one spatial sound effect template as a target spatial sound effect template.

Referring to fig. 8, in some embodiments, the terminal displays a schematic diagram of a prompt box for selecting a spatial sound effect template on the session interface. In this embodiment, when a user clicks a voice message 802, a playing terminal receives a playing operation of the user on the voice message, after a current acoustic scene is determined, a spatial sound effect template matched with the current acoustic scene is acquired and includes an odd type communication template, a surrounding type template and a fly-in fly-out template, a prompt box 804 for selecting the spatial sound effect template is displayed on a session interface 800, the terminal can click text information corresponding to any spatial sound effect model to select the spatial sound effect template, and the playing terminal determines the spatial sound effect template selected by the user as a target spatial sound effect template. The prompt box 804 can also show the option of not using the spatial sound effect template, and when the user clicks the unused spatial sound effect template, the playing terminal plays the voice message without any spatial sound effect processing.

In some embodiments, when the conversation interface diagram has a plurality of voice messages to be played, when the playing operation of the current voice message is received, the acoustic scene recognition is executed, and when the current acoustic scene is obtained through recognition, an interface for selecting a spatial sound effect template prompt box pops up. After the user selects the target spatial sound effect template, the playing terminal continues to perform acoustic scene recognition on the next voice message, if the acoustic scene obtained through recognition is kept unchanged (namely, the acoustic scene is the same as the acoustic scene of the previous voice message), the spatial sound effect template selected by the user in the previous voice message is automatically adopted as the target spatial sound effect template corresponding to the voice message, and if the current acoustic scene obtained through recognition is changed, a prompt box for selecting the spatial sound effect template corresponding to the changed acoustic scene is popped up.

It should be understood that although the various steps in the flow charts of fig. 2-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In some embodiments, as shown in fig. 9, there is provided a speech playing apparatus 900, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes:

a voice obtaining module 902, configured to obtain an original voice to be played, and obtain an environmental sound in a playing environment where a playing terminal of the original voice is currently located;

a scene recognition module 904, configured to perform scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located;

a template obtaining module 906, configured to obtain a target spatial sound effect template matched with a current acoustic scene, where the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene;

the voice processing module 908 is configured to process the original voice according to the target sound effect parameter in the target spatial sound effect template to obtain a target voice, so as to play the target voice in the play terminal.

The voice playing device acquires original voice to be played, acquires environmental sound in a playing environment where a playing terminal of the original voice is currently located, performs scene recognition on the environmental sound, recognizes a current acoustic scene where the playing terminal is located, acquires a target space sound effect template matched with the current acoustic scene, processes the original voice according to target sound effect parameters in the target space sound effect template to obtain target voice, and plays the target voice in the playing terminal.

In some embodiments, the desired voice interaction manner includes a desired voice interaction positional relationship between the sound receiving object and the sound emitting object; the template acquisition module is also used for acquiring a candidate spatial sound effect template set; the candidate spatial sound effect template set comprises a plurality of candidate spatial sound effect templates corresponding to different voice interaction position relations; and selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set, wherein the voice interaction position relation corresponding to the target spatial sound effect template is matched with the expected voice interaction position relation corresponding to the current acoustic scene.

In some embodiments, the candidate spatial sound effect template set comprises candidate spatial sound effect templates with a changed voice interaction position relationship; the template obtaining module is further used for selecting a candidate spatial sound effect template with a changed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene when the current voice interaction position relation corresponding to the current acoustic scene is a dynamic interaction position relation.

In some embodiments, the target spatial sound effect template comprises a target sound orientation parameter sequence corresponding to the dynamic interaction position relationship; the voice processing module is also used for segmenting the original voice into voice segments with the parameter quantity in the target sound azimuth parameter sequence; determining target sound orientation parameters corresponding to the voice segments in the target sound orientation parameter sequence according to the sequence of the voice segments in the original voice; and processing the voice segments according to the target sound orientation parameters corresponding to the voice segments to obtain processed voice segments, and forming target voice by each processed voice segment according to the voice sequence.

In some embodiments, the candidate spatial sound effect template set comprises candidate spatial sound effect templates with fixed voice interaction position relation; and the template acquisition module is also used for selecting a candidate spatial sound effect template with a fixed voice interaction position relation from the candidate spatial sound effect template set to serve as a target spatial sound effect template matched with the current acoustic scene when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation.

In some embodiments, the desired voice interaction location relationship comprises a desired voice interaction distance; the template obtaining module is further used for selecting a candidate spatial sound effect template which has a fixed sound distance corresponding to the sound effect parameter and is matched with the expected voice interaction distance from the candidate spatial sound effect template set to serve as a target spatial sound effect template matched with the current acoustic scene when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation.

In some embodiments, the desired voice interaction manner includes a desired voice interaction positional relationship between the sound receiving object and the sound emitting object; the target sound effect parameters comprise position relation sound effect parameters matched with the expected voice interaction position relation; the voice processing module is further used for performing voice processing on the original voice by using the position relation sound effect parameters in the target space sound effect template to obtain target voice, so that the target voice is matched with the expected voice interaction position relation.

In some embodiments, the desired voice interaction positional relationship comprises a desired voice interaction distance and a desired interaction orientation; the position relation sound effect parameters comprise orientation related sound effect parameters and distance related sound effect parameters; the voice processing module is also used for processing the direction of the original voice by using the direction-related sound effect parameter and processing the sound pressure of the original voice by using the distance-related sound effect parameter to obtain target voice; so that the target voice's azimuth matches the desired interaction azimuth and the target voice's sound pressure matches the desired voice interaction distance.

In some embodiments, the target spatial sound effect template is further matched with a target voice interaction effect, the target voice interaction effect is an expected voice interaction effect of a sound receiving object in a current acoustic scene, and the target sound effect parameters include a voice effect adjustment parameter matched with the target voice interaction effect; the voice processing module is further used for processing the original voice by using the position relation sound effect parameters and the voice effect adjusting parameters in the target space sound effect template to obtain the target voice.

For the specific limitation of the voice playing apparatus, reference may be made to the above limitation on the voice playing method, which is not described herein again. The modules in the voice playing device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a play terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a voice playing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In some embodiments, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of the above-described method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for playing speech, the method comprising:

acquiring original voice to be played, and acquiring environmental sound in a playing environment where a playing terminal of the original voice is currently located;

performing scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located;

acquiring a target space sound effect template matched with the current acoustic scene, wherein the target space sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene;

and processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

2. The method of claim 1, wherein the desired voice interaction manner comprises a desired voice interaction positional relationship between a sound receiving object and a sound emitting object; the step of obtaining the target spatial sound effect template matched with the current acoustic scene comprises the following steps:

acquiring a candidate spatial sound effect template set; the candidate spatial sound effect template set comprises a plurality of candidate spatial sound effect templates corresponding to different voice interaction position relations;

and selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set, wherein the voice interaction position relation corresponding to the target spatial sound effect template is matched with the expected voice interaction position relation corresponding to the current acoustic scene.

3. The method according to claim 2, wherein the candidate spatial sound effect template set comprises candidate spatial sound effect templates with varied voice interaction position relationship; selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set comprises the following steps:

and when the current voice interaction position relation corresponding to the current acoustic scene is a dynamic interaction position relation, selecting a candidate spatial sound effect template with a changed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene.

4. The method according to claim 3, wherein the target spatial sound effect template comprises a target sound orientation parameter sequence corresponding to the dynamic interaction position relationship;

the step of processing the original voice to obtain the target voice according to the target sound effect parameters in the target spatial sound effect template comprises the following steps:

dividing the original voice into voice segments with the parameter quantity in the target sound orientation parameter sequence;

determining a target sound orientation parameter corresponding to the voice fragment in the target sound orientation parameter sequence according to the sequence of the voice fragment in the original voice;

and processing the voice segments according to the target sound orientation parameters corresponding to the voice segments to obtain processed voice segments, wherein each processed voice segment forms the target voice according to a voice sequence.

5. The method according to claim 2, wherein the candidate spatial sound effect template set comprises candidate spatial sound effect templates with fixed voice interaction position relationship; the step of selecting and obtaining a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set comprises the following steps:

and when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation, selecting a candidate spatial sound effect template with a fixed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene.

6. The method of claim 5, wherein the desired voice interaction location relationship comprises a desired voice interaction distance;

when the current voice interaction position relationship corresponding to the current acoustic scene is a fixed interaction position relationship, selecting a candidate spatial sound effect template with a fixed voice interaction position relationship from the candidate spatial sound effect template set, and using the candidate spatial sound effect template as a target spatial sound effect template matched with the current acoustic scene comprises the following steps:

and when the current voice interaction position relation corresponding to the current acoustic scene is a fixed interaction position relation, selecting a candidate spatial sound effect template with a fixed sound distance corresponding to the sound effect parameter from the candidate spatial sound effect template set, wherein the candidate spatial sound effect template is matched with the expected voice interaction distance according to the sound distance, and is used as a target spatial sound effect template matched with the current acoustic scene.

7. The method of claim 1, wherein the desired voice interaction manner comprises a desired voice interaction positional relationship between a sound receiving object and a sound emitting object; the target sound effect parameters comprise position relation sound effect parameters matched with the expected voice interaction position relation;

and performing voice processing on the original voice by using the position relation sound effect parameter in the target space sound effect template to obtain a target voice, so that the target voice is matched with the expected voice interaction position relation.

8. The method of claim 7, wherein the desired voice interaction positional relationship comprises a desired voice interaction distance and a desired interaction orientation; the position relation sound effect parameters comprise orientation related sound effect parameters and distance related sound effect parameters;

the method comprises the following steps of utilizing position relation sound effect parameters in the target space sound effect template to carry out voice processing on the original voice to obtain target voice, wherein the step of obtaining the target voice comprises the following steps:

processing the direction of the original voice by using the direction-related sound effect parameters, and processing the sound pressure of the original voice by using the distance-related sound effect parameters to obtain target voice; such that the target voice's position matches the desired interaction position and the target voice's acoustic pressure matches the desired voice interaction distance.

9. The method according to claim 7, wherein the target spatial sound effect template is further matched with a target voice interaction effect, the target voice interaction effect is an expected voice interaction effect of a sound receiving object in the current acoustic scene, and the target sound effect parameters comprise a voice effect adjustment parameter matched with the target voice interaction effect;

and processing the original voice by using the position relation sound effect parameters in the target space sound effect template and the voice effect adjustment parameters to obtain the target voice.

10. The method of claim 1, wherein the performing scene recognition on the environmental sound to obtain a current acoustic scene where the play terminal is located comprises:

acquiring a plurality of sound sub-segments corresponding to the environmental sound, and performing feature extraction on the sound sub-segments to obtain sub-segment features;

identifying and obtaining a segment acoustic scene corresponding to the sound sub-segment based on the sub-segment characteristics;

counting the segment acoustic scenes corresponding to the sound sub-segments to obtain the number of scenes corresponding to each segment acoustic scene;

and selecting the acoustic scene with the largest scene number as the current acoustic scene where the playing terminal is located.

11. A voice playback apparatus, characterized in that the apparatus comprises:

the voice acquisition module is used for acquiring original voice to be played and acquiring the environmental sound in the playing environment where the playing terminal of the original voice is currently located;

the scene recognition module is used for carrying out scene recognition on the environmental sound to obtain a current acoustic scene where the playing terminal is located;

the template acquisition module is used for acquiring a target spatial sound effect template matched with the current acoustic scene, the target spatial sound effect template is matched with a target voice interaction mode, and the target voice interaction mode is an expected voice interaction mode of a sound receiving object in the current acoustic scene;

and the voice processing module is used for processing the original voice to obtain a target voice according to the target sound effect parameters in the target spatial sound effect template so as to play the target voice in the playing terminal.

12. The apparatus of claim 11, wherein the desired voice interaction manner comprises a desired voice interaction positional relationship between a sound receiving object and a sound emitting object; the template acquisition module is also used for acquiring a candidate spatial sound effect template set; the candidate spatial sound effect template set comprises a plurality of candidate spatial sound effect templates corresponding to different voice interaction position relations; and selecting a target spatial sound effect template matched with the current acoustic scene from the candidate spatial sound effect template set, wherein the voice interaction position relation corresponding to the target spatial sound effect template is matched with the expected voice interaction position relation corresponding to the current acoustic scene.

13. The apparatus according to claim 12, wherein the candidate spatial sound effect template set comprises candidate spatial sound effect templates with varied voice interaction position relationship; and the template acquisition module is also used for selecting a candidate spatial sound effect template with a changed voice interaction position relation from the candidate spatial sound effect template set as a target spatial sound effect template matched with the current acoustic scene when the current voice interaction position relation corresponding to the current acoustic scene is a dynamic interaction position relation.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.