CN108109630B

CN108109630B - Audio processing method and device and media server

Info

Publication number: CN108109630B
Application number: CN201611037628.4A
Authority: CN
Inventors: 牛超
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2022-01-25
Anticipated expiration: 2036-11-23
Also published as: WO2018094968A1; CN108109630A

Abstract

The invention discloses an audio processing method and device, wherein the method comprises the following steps: sampling voice data of each conversation member of a stereo multi-person conversation to determine an audio frequency value of each conversation member; and setting audio characteristics when all conversation members speak according to the audio frequency value, wherein the audio characteristics are the volume proportion of all sound channels when all conversation members speak. In the process of stereo multi-person conversation, the voice data of each conversation member is sampled, the audio frequency characteristic of each conversation member during speaking is set according to the audio frequency value of each conversation member, different audio frequency characteristics can present different stereo effects, and the conversation members can bring a virtual seat feeling when listening to other users for speaking, so that the conversation members can be more easily distinguished as if each conversation member is in one direction, and the user experience is better.

Description

Audio processing method and device and media server

Technical Field

The present invention relates to the field of communications, and in particular, to an audio processing method and apparatus, and a media server.

Background

As for stereo multi-person conversation technology, it is mature at present, and teleconferencing is a common application scenario of stereo multi-person conversation technology. In the existing teleconference between a CS (Circuit Switch) domain and a PS (Packet Switch) domain, only each user can hear the speech during the communication process, but because the number of users in the teleconference is large, it is impossible to distinguish which user is speaking, and the user experience is poor.

Disclosure of Invention

The invention provides an audio processing method, an audio processing device and a media server, which are used for solving the following problems in the prior art: in the process of stereo multi-person conversation, because the number of users is large, which user is speaking cannot be distinguished, and the user experience is poor.

To solve the above technical problem, in one aspect, the present invention provides an audio processing method, including: sampling voice data of each conversation member of a stereo multi-person conversation to determine an audio frequency value of each conversation member; and setting audio characteristics when all conversation members speak according to the audio frequency value, wherein the audio characteristics are the volume proportion of all sound channels when all conversation members speak.

Optionally, setting an audio characteristic when each conversation member speaks according to the audio frequency value includes: judging whether audio frequency values in the same preset audio frequency range exist or not; if the conversation members exist, different audio features are set for each conversation member in the same preset audio frequency range; if not, the same or different audio characteristics are set for the respective conversation members.

Optionally, setting an audio characteristic when each conversation member speaks according to the audio frequency value includes: sequencing the audio frequency values of all the conversation members; and setting different audio characteristics for the conversation members adjacent to the audio frequency value in the sorting.

Optionally, the audio features are set for the conversation members as follows: acquiring volume information of each sound channel of the conversation members; acquiring a channel volume proportion from the calculated multiple channel volume proportions; and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

Optionally, after setting the audio frequency characteristics of each conversation member when speaking according to the audio frequency value, the method further includes: under the condition that any conversation member speaks, acquiring the audio characteristics of the speaking conversation member; and adjusting and inputting the audio of the audio players of the other conversation members except the conversation member who speaks according to the audio characteristics.

In another aspect, the present invention further provides an apparatus for processing audio, including: the sampling module is used for sampling voice data of each conversation member of the stereo multi-person conversation so as to determine an audio frequency value of each conversation member; and the setting module is used for setting audio characteristics when each conversation member speaks according to the audio frequency value, wherein the audio characteristics are the volume proportion of each sound channel when each conversation member speaks.

Optionally, the setting module includes: the judging unit is used for judging whether audio frequency values in the same preset audio frequency range exist or not; the first setting unit is used for setting different audio characteristics for each conversation member in the same preset audio frequency range under the condition that the audio frequency values in the same preset audio frequency range exist; and under the condition that the audio frequency values in the same preset audio frequency range do not exist, setting the same or different audio characteristics for each conversation member.

Optionally, the setting module includes: the sequencing unit is used for sequencing the audio frequency values of all the conversation members; and the second setting unit is used for setting different audio characteristics for the conversation members adjacent to the audio frequency value in the sorting.

Optionally, the setting module is specifically configured to set an audio feature for the conversation member according to the following manner: acquiring volume information of each sound channel of the conversation members; acquiring a channel volume proportion from the calculated multiple channel volume proportions; and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

Optionally, the method further includes: the acquisition module is used for acquiring the audio characteristics of the speaking conversation members under the condition that any conversation member speaks; and the input module is used for adjusting and inputting the audio of the players of the other conversation members except the conversation member who speaks according to the audio characteristics.

In another aspect, the present invention further provides a media server, including: the collector is used for sampling voice data of each conversation member of the stereo multi-person conversation; and the processor is used for determining the collected audio frequency value of each conversation member and setting the audio characteristic when each conversation member speaks according to the audio frequency value, wherein the audio characteristic is the volume proportion of each sound channel when the conversation member speaks.

Optionally, the processor is specifically configured to determine whether there are audio frequency values in the same preset audio frequency range; under the condition that audio frequency values in the same preset audio frequency range exist, different audio features are set for each conversation member in the same preset audio frequency range; under the condition that the audio frequency values in the same preset audio frequency range do not exist, the same or different audio characteristics are set for each conversation member; or, the method is further configured to rank the audio frequency values of the session members, and set different audio features for the session members adjacent to the audio frequency values in the ranking.

In the process of stereo multi-person conversation, the voice data of each conversation member is sampled, the audio frequency characteristic of each conversation member when speaking is set according to the audio frequency value of each conversation member, different audio characteristics can present different stereo effects, and the conversation members can bring a virtual seat feeling when listening to other users to speak as if each conversation member is in one direction, so that the conversation members who speak can be distinguished more easily, the user experience is better, and the following problems in the prior art are solved: in the process of stereo multi-person conversation, because the number of users is large, which user is speaking cannot be distinguished, and the user experience is poor.

Drawings

FIG. 1 is a flow chart of a method of processing audio in a first embodiment of the invention;

FIG. 2 is a schematic diagram of a processing apparatus for audio according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a preferred structure of an audio processing device according to a second embodiment of the present invention;

FIG. 4 is a topological diagram of the relationship between a media server and conference members according to a fourth embodiment of the present invention;

FIG. 5 is a schematic diagram of a virtual space location setting according to a fourth embodiment of the present invention;

fig. 6 is a schematic diagram of another virtual space position setting in the fourth embodiment of the present invention.

Detailed Description

In order to solve the following problems in the prior art: in the process of stereo multi-person conversation, because the number of users is large, which user is speaking cannot be distinguished, and the user experience is poor; the invention provides an audio processing method, an audio processing device and a media server, and the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The first embodiment of the present invention provides an audio processing method, the flow of which is shown in fig. 1, including steps S102 to S104:

s102, voice data of each conversation member of the stereo multi-person conversation is sampled to determine an audio frequency value of each conversation member. Wherein stereo is a sound source comprising at least two channels, which is different from mono. The single sound channel only has one sound channel, so that the sound is only heard; however, since stereo sound has a plurality of channels, a user can listen to it with a sense of stereo, and thus, a multi-channel sound source is called stereo sound, and for example, when a movie theater is used to watch a movie, the sound is played as stereo sound. Because it is stereo, subsequent audio adjustments can only be made.

And S104, setting audio characteristics when each conversation member speaks according to the audio frequency value, wherein the audio characteristics are the volume proportion of each sound channel when each conversation member speaks.

In the embodiment of the invention, in the process of stereo multi-person conversation, the voice data of each conversation member is sampled, the audio characteristics of each conversation member when speaking are set according to the audio frequency value of each conversation member, different audio characteristics can present different stereo effects, and the conversation members can bring a virtual seat feeling when listening to the speech of other users, so that the conversation members are in a same direction as if each conversation member is in the same direction, the conversation members who speak can be distinguished more easily, the user experience is better, and the following problems in the prior art are solved: in the process of stereo multi-person conversation, because the number of users is large, which user is speaking cannot be distinguished, and the user experience is poor.

In the implementation process, the setting of the audio characteristics when the conversation members speak according to the audio frequency value may include various ways, which are described in two cases below.

In the first case: judging whether audio frequency values in the same preset audio frequency range exist or not; if the conversation member exists, the voice characteristics of the users are similar, and the users are easy to generate confusion feeling when listening, so that different audio characteristics are set for each conversation member in the same preset audio frequency range.

For example, when the audio frequency values of two conversation members are both in the same preset audio frequency range, at this time, when the audio feature adjustment is performed on the two conversation members, the audio features of the two conversation members may not be adjusted to be the same, and once the audio features are adjusted to be the same, other conversation members cannot distinguish which conversation member is speaking when listening to the speech of the two conversation members, so that it is necessary to set different audio features for each conversation member in the same preset audio frequency range, and the more distinct the audio feature distinction is, the better the feeling like "one speaking on the left and one speaking on the right" is heard.

If the audio frequency value in the same preset audio frequency range does not exist, the probability that each conversation member is easy to confuse is low, therefore, during setting, the same audio characteristics can be set for each conversation member, the listened conversation member can distinguish different users through the audio frequency, and certainly, different audio characteristics can be set for different conversation members in order to distinguish different speakers more clearly.

In the second case: when the audio frequency characteristics of each conversation member during speaking are set according to the audio frequency values, the audio frequency values of each conversation member can be sequenced; in this way, different audio characteristics are set for the conversation members adjacent to the audio frequency value in the sorting, and the conversation members with similar audio frequencies can be distinguished. Certainly, when the specific setting is performed, three or four adjacent users can be set with different audio features so as to better distinguish different speakers, so that the users can feel that different speakers are in different directions, and the user experience is better.

During setting, audio features are set for the conversation members according to the following modes: acquiring volume information of each sound channel of the conversation members; acquiring a channel volume proportion from the calculated multiple channel volume proportions; and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

After the audio frequency characteristics of each conversation member during speaking are set according to the audio frequency values, the audio characteristics of the conversation members who speak can be obtained when any conversation member speaks; and adjusting the audio of the audio players of the other conversation members except the conversation member who speaks according to the audio characteristics.

A second embodiment of the present invention provides an audio processing apparatus, which is schematically shown in fig. 2 and includes: a sampling module 10, configured to sample voice data of each conversation member of a stereo multi-person conversation to determine an audio frequency value of each conversation member; and the setting module 20 is coupled to the sampling module 10 and configured to set, according to the audio frequency value, an audio characteristic of each conversation member when speaking, where the audio characteristic is a volume proportion of each channel when the conversation member speaks.

The setting module 20 may include: the judging unit is used for judging whether audio frequency values in the same preset audio frequency range exist or not; the first setting unit is used for setting different audio characteristics for each conversation member in the same preset audio frequency range under the condition that the audio frequency values in the same preset audio frequency range exist; and under the condition that the audio frequency values in the same preset audio frequency range do not exist, setting the same or different audio characteristics for each conversation member.

The setup module 20 may further include: the sequencing unit is used for sequencing the audio frequency values of all the conversation members; and the second setting unit is used for setting different audio characteristics for the conversation members adjacent to the audio frequency value in the sorting.

The setting module is specifically configured to set audio features for the session members as follows: acquiring volume information of each sound channel of the conversation members; acquiring a channel volume proportion from the calculated multiple channel volume proportions; and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

Fig. 3 shows a preferred structure of the above processing apparatus for audio in a teleconference, which further includes: an obtaining module 30, coupled to the setting module 20, configured to obtain, when any conversation member speaks, an audio feature of the speaking conversation member; and an input module 40, coupled to the obtaining module 30, for adjusting and inputting the audio of the players of the conversation members except the conversation member who speaks according to the audio characteristics.

A third embodiment of the present invention provides a media server, including: the collector is used for sampling voice data of each conversation member of the stereo multi-person conversation; and the processor determines the collected audio frequency value of each conversation member and sets the audio characteristics of each conversation member when speaking according to the audio frequency value, wherein the audio characteristics are the volume proportion of each sound channel when the conversation member speaks.

The processor is specifically used for judging whether audio frequency values in the same preset audio frequency range exist or not; under the condition that audio frequency values in the same preset audio frequency range exist, different audio features are set for each conversation member in the same preset audio frequency range; under the condition that the audio frequency values in the same preset audio frequency range do not exist, the same or different audio characteristics are set for each conversation member; or, the method is also used for sequencing the audio frequency values of the conversation members and setting different audio characteristics for the conversation members adjacent to the audio frequency values in the sequencing.

A fourth embodiment of the present invention provides a method for processing audio in a teleconference, which mainly relates to an improvement of a mixing function of a media server in charge of mixing audio in the teleconference, in a CS domain and a PS domain. In the embodiment, in order to establish a model of virtual spatial positions (which are set in virtual positions, one or more virtual spatial positions may exist in one virtual position) of conference members, that is, a feeling that each participant has one virtual seat is given to the conference members, so that the voices of the members in a teleconference are distinguished by mixing the voices through the feeling of orientation in combination with the characteristics of the voices of the conference members. The method comprises steps (1) to (3).

(1) The media server samples the sound data of each conference member.

As shown in fig. 4, after the teleconference is established, from the perspective of voice data stream, all the 7-way terminals participating in the teleconference and the media service on the core network side have the following star topology result, each terminal will send its uplink voice data to the media server through the RTP protocol, and the media server will also send the teleconference downlink data to each terminal. The media server may sample the terminal upstream voice data during the period when each member has just joined the conference (e.g., the first 10 seconds of access to the conference call). Because both parties can say "feed, hello! ' etc. to perform simple small talk. And sampling is carried out in sequence according to the sequence of accessing the teleconference.

(2) The media is ordered by sound from low frequency to high frequency.

After the media server obtains the sampling data, the function of the voiceprint recognition software/hardware is called, the uplink sound data of each member is analyzed, the frequency value is calculated, and a sequencing result from low to high is generated.

(3) And the media server performs sound mixing according to the sequencing result and by combining the orientation sound mixing function.

After the sequencing result is obtained, the media server completes the audio mixing process according to the preset audio mixing rule (the audio mixing process is a process of adjusting the audio characteristics).

This mixing rule may consider two dimensions: member location (which member is desired at which virtual location), member voice characteristics. Specifically, the positions can be divided into 6 positions, namely, left far, left near, middle far, middle near, right far and right near; the voice characteristics of the members can be divided into two groups of high frequency/low frequency, or can be simply grouped according to gender (the frequency of the voice is high or low).

As is known, the voice conference call members in the PS domain support 6 paths at maximum, i.e. there are 6 terminals participating in the conference call except the local one. By the analysis of the voiceprint, 3 members are assigned to the low frequency group, 3 members are assigned to the high frequency group, the low frequency group is assigned to the side close to the own machine in the virtual spatial position, and the high frequency group is assigned to the far side in the virtual spatial position.

In each frequency group, the members are correspondingly allocated to the left, middle and right positions according to the frequencies from low, middle and high, so that the positions are exactly 6 virtual space positions, as shown in fig. 5. When mixing, the intensity of the sound is divided by 1/3 for each of the left, center, and right positions, and the far side and near side may be divided by 0.5 to 0.5, 0.45 to 0.55, 0.4 to 0.6, etc. The process maximally isolates the interference of sound among all members, increases the signal-to-noise ratio and improves the recognition rate of the sound.

In the case of 3-way teleconference, it is only necessary to mix the audio at 1/3 for each of the left, center and right positions. If the number of the conference members is more than 6, the positions can be subdivided, wherein N is the number of the conference members, the calculation formula of the included angle between the directions of the two adjacent virtual space positions is 180/((N-1)/2-1), wherein N is the total number of the participants, in this embodiment, N is 9, so the included angle is 60 degrees, and the setting schematic is shown in fig. 6.

It should be noted that the voice data of the teleconference member assigned to the right left virtual space position is mixed into only the left channel; voice data of the teleconference members assigned to the right virtual space position is mixed into only the right channel; and the speech data of other members in the non-left and non-right virtual space positions are mixed into the left and right channels according to the angle value, and the calculation mode is as follows:

whether the virtual position is to the left or right is confirmed. If the angle is deviated from the left side, the angle value A deviated from the horizontal line on the left side is determined, and the value range of the angle A is 0-90 degrees. Then, the specific gravity of the left channel is 1 — the specific gravity of the right channel; the specific gravity of the right channel is Tan (a/2)/2. If the angle is deviated from the right side, determining an angle value B deviated from a horizontal line on the right side, wherein the value range of the angle B is 0-90 degrees. Then, the specific gravity of the right channel is 1 — the specific gravity of the left channel; the specific gravity of the left channel is Tan (B/2)/2.

In this embodiment, after the sound data of each conference member is acquired, a virtual role may be assigned to each conference member, that is, a virtual position corresponding to each conference member is set for each conference member, that is, it is equivalent to let the user a sit on the chair 1, the user B sit on the chair 2, the user C sit on the chair 3, and the like. This process is to achieve the goal of placing users with similar sound data in different orientations. After the virtual position where the conference member is expected to sit is set, the sound of each conference member needs to be tuned, namely, the audio characteristics are set for each conference member, and therefore the situation that each conference member is expected to sit in the corresponding virtual seat is achieved.

The embodiment of the invention realizes the effect of distinguishing each conference member in the teleconference as a whole, and has higher user experience.

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and the scope of the invention should not be limited to the embodiments described above.

Claims

1. A method for processing audio, comprising:

sampling voice data of each conversation member of a stereo multi-person conversation to determine an audio frequency value of each conversation member;

setting audio characteristics of each conversation member when speaking according to the audio frequency value, wherein the audio characteristics comprise:

judging whether audio frequency values in the same preset audio frequency range exist or not;

if the conversation members exist, different audio features are set for each conversation member in the same preset audio frequency range;

if not, setting the same or different audio characteristics for each conversation member; wherein, the audio features are the volume proportion of each sound channel when the conversation members speak.

2. The process of claim 1, wherein setting the audio characteristics of the individual conversation members when speaking based on the audio frequency value comprises:

sequencing the audio frequency values of all the conversation members;

and setting different audio characteristics for the conversation members adjacent to the audio frequency value in the sorting.

3. The process of claim 1, wherein audio features are set for the conversation members as follows:

acquiring volume information of each sound channel of the conversation members;

acquiring a channel volume proportion from the calculated multiple channel volume proportions;

and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

4. The processing method according to any one of claims 1 to 3, wherein after setting the audio characteristics when the respective conversation members speak according to the audio frequency value, further comprising:

under the condition that any conversation member speaks, acquiring the audio characteristics of the speaking conversation member;

and adjusting and inputting the audio of the audio players of the other conversation members except the conversation member who speaks according to the audio characteristics.

5. An apparatus for processing audio, comprising:

the sampling module is used for sampling voice data of each conversation member of the stereo multi-person conversation so as to determine an audio frequency value of each conversation member;

the setting module is used for setting audio characteristics when each conversation member speaks according to the audio frequency value, wherein the audio characteristics are the volume proportion of each sound channel when each conversation member speaks; the setting module includes:

the judging unit is used for judging whether audio frequency values in the same preset audio frequency range exist or not;

the first setting unit is used for setting different audio characteristics for each conversation member in the same preset audio frequency range under the condition that the audio frequency values in the same preset audio frequency range exist; and under the condition that the audio frequency values in the same preset audio frequency range do not exist, setting the same or different audio characteristics for each conversation member.

6. The processing apparatus of claim 5, wherein the setup module comprises:

the sequencing unit is used for sequencing the audio frequency values of all the conversation members;

and the second setting unit is used for setting different audio characteristics for the conversation members adjacent to the audio frequency value in the sorting.

7. The processing apparatus as claimed in claim 5, wherein the setting module is specifically configured to set audio features for the conversation members as follows:

acquiring volume information of each sound channel of the conversation members; acquiring a channel volume proportion from the calculated multiple channel volume proportions; and adjusting the volume information of each sound channel of the conversation members according to the sound channel volume proportion.

8. The processing apparatus according to any one of claims 5 to 7, further comprising:

the acquisition module is used for acquiring the audio characteristics of the speaking conversation members under the condition that any conversation member speaks;

and the input module is used for adjusting and inputting the audio of the players of the other conversation members except the conversation member who speaks according to the audio characteristics.

9. A media server, comprising:

the collector is used for sampling voice data of each conversation member of the stereo multi-person conversation;

the processor is used for determining the collected audio frequency value of each conversation member and setting the audio characteristic when each conversation member speaks according to the audio frequency value, wherein the audio characteristic is the volume proportion of each sound channel when the conversation member speaks; the processor is specifically configured to determine whether there are audio frequency values within the same preset audio frequency range; under the condition that audio frequency values in the same preset audio frequency range exist, different audio features are set for each conversation member in the same preset audio frequency range; and under the condition that the audio frequency values in the same preset audio frequency range do not exist, setting the same or different audio characteristics for each conversation member.

10. The media server of claim 9,

the processor is further configured to sort the audio frequency values of the session members, and set different audio characteristics for the session members adjacent to the audio frequency values in the sorting.