CN113380260B

CN113380260B - Audio processing method and device

Info

Publication number: CN113380260B
Application number: CN202010115396.XA
Authority: CN
Inventors: 高毅; 罗程; 李斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2023-11-07
Anticipated expiration: 2040-02-25
Also published as: CN113380260A

Abstract

The application relates to an audio processing method and device, wherein the method comprises the following steps: participating in the target online conference through the currently logged-in user identification; acquiring conference audio generated in the target online conference; acquiring an audio watermark corresponding to the user identifier; the audio watermark is generated according to the identity data of the conference member identified by the user identification; adding the audio watermark into the conference audio to obtain target audio of the target online conference; the audio watermark added in the target audio is used for positioning conference members generating the target audio when the target audio is leaked. The scheme provided by the application can improve the safety of conference contents.

Description

Audio processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus.

Background

With the development of computer technology and audio processing technology, more and more scenes will involve the processing of audio signals. For example, in a voice interactive scenario, a user may record pictures displayed in a screen, or record audio, so that the recorded video or audio may be played back when desired.

However, most audio or video recorded in a scene may include sensitive information or privacy related content. Thus, when recorded video or audio is transmitted between users, sensitive information or privacy related content therein is revealed, which results in a problem of reduced security of the recorded content.

Disclosure of Invention

Based on this, it is necessary to provide an audio processing method and apparatus for solving the technical problem of low security of recorded content.

An audio processing method, comprising:

participating in the target online conference through the currently logged-in user identification;

acquiring conference audio generated in the target online conference;

acquiring an audio watermark corresponding to the user identifier; the audio watermark is generated according to the identity data of the conference member identified by the user identification;

adding the audio watermark into the conference audio to obtain target audio of the target online conference; the audio watermark added in the target audio is used for positioning conference members generating the target audio when the target audio is leaked.

An audio processing apparatus, comprising:

the participation module is used for participating in the target online conference through the currently logged-in user identification;

The acquisition module is used for acquiring conference audio generated in the target online conference; acquiring an audio watermark corresponding to the user identifier; the audio watermark is generated according to the identity data of the conference member identified by the user identification;

the adding module is used for adding the audio watermark into the conference audio to obtain target audio of the target online conference; the audio watermark added in the target audio is used for positioning conference members generating the target audio when the target audio is leaked.

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described audio processing method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described audio processing method.

According to the audio processing method, the device, the computer readable storage medium and the computer equipment, after the user identification logged in at present participates in the target online conference, conference audio generated in the target online conference and the audio watermark corresponding to the user identification can be obtained, and then the audio watermark is added into the conference audio to obtain the target audio in which the conference content of the target online conference is recorded. In this way, because the audio watermark is generated according to the identity data of the conference member, in objective aspect, the user revealing the audio can be reversely tracked by detecting the watermark in the audio, so that the safety of the conference content is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

An audio processing method, comprising:

acquiring target audio corresponding to a target online conference; the target audio is audio obtained by adding an audio watermark in conference audio of the target online conference; the audio watermark is generated according to the identity data of the conference member;

performing separation operation on the target audio to obtain the audio watermark;

analyzing the audio watermark to obtain identity data;

and determining conference members corresponding to the identity data among conference members participating in the target online conference.

An audio processing apparatus, comprising:

the acquisition module is used for acquiring target audio corresponding to the target online conference; the target audio is audio obtained by adding an audio watermark in conference audio of the target online conference; the audio watermark is generated according to the identity data of the conference member;

the separation module is used for performing separation operation on the target audio to obtain the audio watermark;

the determining module is used for analyzing the audio watermark to obtain identity data; and determining conference members corresponding to the identity data among conference members participating in the target online conference.

According to the audio processing method, the audio processing device, the computer readable storage medium and the computer equipment, after the target audio corresponding to the target online conference is acquired, the target audio can be separated to obtain the audio watermark, and the identity data for generating the audio watermark is further analyzed, so that conference members which participate in the target online conference are determined, and the conference members corresponding to the identity data. Because the target audio is the audio obtained by adding the audio watermark in the conference audio of the target online conference, and the audio watermark is generated according to the identity data of the conference members. In an objective aspect, users who leak audio are reversely tracked, and the safety of conference contents is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

Drawings

FIG. 1 is a diagram of an application environment for an audio processing method in one embodiment;

FIG. 2 is a flow chart of an audio processing method according to an embodiment;

FIG. 3 is a data flow diagram of audio processing in one embodiment;

FIG. 4 is a data flow diagram of audio processing in another embodiment;

FIG. 5 is a flow chart illustrating steps of audio processing in one embodiment;

FIG. 6 is a schematic diagram of watermark information for an audio watermark in one embodiment;

FIG. 7 is a flow chart of an audio processing method according to another embodiment;

FIG. 8 is a flow chart of adding and detecting watermarks in one embodiment;

FIG. 9 is a block diagram of an audio processing device in one embodiment;

FIG. 10 is a block diagram of an audio processing device according to another embodiment;

FIG. 11 is a block diagram of an audio processing device according to another embodiment;

FIG. 12 is a block diagram of a computer device in one embodiment;

fig. 13 is a block diagram of a computer device in another embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

FIG. 1 is a diagram of an application environment for an audio processing method in one embodiment. Referring to fig. 1, the application environment of the audio processing method includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The number of terminals 110 is more than one. The terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. It should be noted that, in the embodiment of the present application, the terminal 110 is configured to perform an audio processing method. The terminal 110 may have an application running thereon, and the terminal 110 may also perform an audio processing method through the application.

In one embodiment, as shown in fig. 2, an audio processing method is provided. The present embodiment is mainly exemplified by the application of the method to the terminal 110 (or the application running on the terminal 110) in fig. 1 described above. Referring to fig. 2, the audio processing method specifically includes the steps of:

s202, participating in the target online conference through the currently logged-in user identification.

Wherein the user identification is used to uniquely identify a user. An application running on the terminal may log in through the user identification, thereby operating with the identity of the user identified by the user identification. The user identifier may specifically be a user account registered in the application program. User accounts vary from user to user.

The target online conference is an online conference in which the target participates as a target. An online meeting is a virtual place where more than one user interact with information based on the internet. The online conference may specifically be a conference that performs voice interaction based on the internet, such as an online audio conference or an online video conference. The online conference may be a two-person conference or a multi-person conference, etc. Here, a multi-person in a multi-person conference means two or more.

Specifically, the terminal or an application running on the terminal may participate in the target online conference through the currently logged-in user identification. The currently logged-in user identification identifies the identified conference member, which may be the initiator of the target online conference or the participant invited to participate in the target online conference. The application here may be a dedicated conferencing application. In addition, the meeting performed in the social session of the social application program can also be the target online meeting in the embodiment of the application.

In one embodiment, participating in a target online meeting with a currently logged-in user identification includes: establishing connection with a server through signaling based on the currently logged-in user identification so as to participate in a target online conference; receiving the identity data of the conference member identified by the user identification, which is issued by the server through signaling; the identity data of the different conference members participating in the target online conference are different.

Wherein the signaling is a control instruction in the communication system. The signaling may direct the establishment of a temporary communication channel between designated terminals and maintain proper operation of the network itself.

Specifically, when a conference is initiated, conference members of the conference log in a terminal or an application running on the terminal through a user identifier, and the terminal or the application running on the terminal exchanges information with a server through signaling and establishes connection based on the currently logged-in user identifier. I.e. there is a signaling path between each terminal and the server for transmitting signaling. The server may allocate physical channels for the conference, physical resources for communication based on the conference, such as radio frequencies, connection lines, fiber optic lines on the server network, CPU and memory resources, etc. The server may also establish a logical channel for the conference, such as establishing a logical resource for the conference that is not a material of the conference room. Such as creating a chat room for a social session or a meeting room for an audio video conference, etc.

Further, the server may also generate identity data for each conference member in the conference. The identity data of each conference member is different, so that the audio watermarks generated according to the identity data are different. In particular, the server may assign each conference member a unique user number as the conference member's identity data. For example, the target online conference is an audio-video conference, after a user initiates the audio-video conference, the server may create a conference room in which the audio-video conference is performed, assign a room number to the conference room, and assign a conference member number to each conference member participating in the audio-video conference. Or, because the user initiates the audio-video conference through the application program, a unique user number (i.e. user identification) can be allocated to the user when the user registers with the application program, and the user number uniquely identifies a user. Then, the server can send the identity data of the conference members to the terminals corresponding to the conference members through signaling based on the signaling paths. For example, in fig. 1, each terminal receives a different user number 12341, …,12344, etc.

In a particular embodiment, the conference member's identity data may be personal identity data, such as audio video conferences or the like that are attended to within the enterprise on behalf of the individual. The identity data of conference members may also be organizational identity data, such as audio-visual conferences among participating enterprises on behalf of the enterprises, etc.

In the embodiment, independent identity data is allocated to each conference member, and a user is uniquely identified, so that the audio watermarks generated according to the identity data are different, and the reverse tracking based on the audio watermark detection can be realized, so that the users revealing the audio can be reversely tracked, and the safety of conference contents is improved.

S204, acquiring conference audio generated in the target online conference.

Wherein the conference audio is audio generated based on sound signals in the target online conference. It will be appreciated that each conference member participating in the targeted online conference may generate a sound signal during the conference. These sound signals may all serve as sources of conference audio. For example, in an audio-video conference, each conference member may speak in the conference, and conference member voices (i.e., sound signals) generated during the speaking may be sources of conference audio.

Specifically, the terminal or an application running on the terminal may receive, respectively, sound signals generated by each conference member in the target online conference during the target online conference, and process and mix the sound signals to serve as conference audio. The terminal or the application program running on the terminal can also directly receive the audio after mixing the sound signals generated by each conference member in the process of carrying out the target online conference, and the audio is processed and then used as conference audio.

In another embodiment, when the local conference member generates a sound signal during the target online conference, the terminal or the application running on the terminal may also mix the local sound signal with the conference audio in the above embodiment to be the final complete conference audio.

S206, acquiring an audio watermark corresponding to the user identifier; the audio watermark is generated from the identity data of the conference member identified by the user identification.

Wherein the watermark (Digital Watermark) is a kind of protection information embedded in the carrier file using a computer algorithm. Audio watermarking is the embedding of protection information of an audio carrier file using a computer algorithm. In this embodiment, the audio carrier file is conference audio of a target online conference. Identity data is data reflecting the identity of a user.

In one embodiment, obtaining an audio watermark corresponding to a user identification includes: and generating an audio watermark corresponding to the user identifier according to the identity data.

The user identifications are in one-to-one correspondence with the identity data, and the audio watermarks generated according to the identity data are the audio watermarks corresponding to the user identifications. In particular, the terminal may directly use the identity data as watermark information for the audio watermark. For example, the user number is used as watermark information of the audio watermark. The terminal can also use any other information such as identity data, organization information of conference members, time information and the like as watermark information of the audio watermark, so long as the requirement that one conference member can be uniquely confirmed through the watermark information is met.

In one embodiment, the terminal or an application running on the terminal generates the audio watermark based on the identity data of the conference member identified by the locally logged-in user identification after the identity data is obtained. Therefore, the audio watermark can be directly obtained for use when the audio watermark is needed to be used later, and the processing efficiency is improved.

S208, adding the audio watermark into conference audio to obtain target audio of a target online conference; the audio watermark added in the target audio is used for locating conference members generating the target audio when the target audio is leaked.

Specifically, the terminal or an application running on the terminal adds an audio watermark to conference audio, and a watermark embedding algorithm can be adopted, wherein the watermark embedding algorithm can be a coefficient quantization method, a spatial domain algorithm, a transform domain algorithm, a least significant bit algorithm, an echo hiding algorithm, a phase encoding algorithm and the like. The watermark embedding algorithm can be used for adding specific noise watermarks, echo watermarks or out-of-band signals which are not easily perceived by human ears to conference audio, or can be used for modifying amplitude or phase of the conference audio or carrying out specific frequency notch.

The audio watermarking process may be performed in the time domain, the frequency domain, or any transform domain (e.g., wavelet transform DWT, discrete cosine transform DCT, complex modulated lapped transform MCLT, etc.).

In a specific embodiment, the sampled data of the conference audio is represented in the form of binary values, so that an audio watermark in the form of binary codes can be obtained and added to the conference audio to obtain the target audio.

In further embodiments, the audio watermark may be added to a particular frequency range of the conference audio. For example, an audio watermark may be added to the low frequency band of the conference audio, so that the effect of reducing the sampling rate may be avoided. In addition, when the audio watermark is added to conference audio, the audio watermark can be prevented from being added to a frequency band below 200Hz, so that the influence of a direct current signal can be avoided.

In addition, the audio watermark has the advantages of concealment, stability, safety and the like, is not easy to tamper, does not influence the playing effect of the audio signal, can furthest reduce the damage to the audio quality, achieves the effect that the human ear cannot hear the additional watermark information, and does not influence the normal conversation and the playback tone quality of the audio.

It can be appreciated that, as the audio watermark is added to the target audio, the audio watermark is generated according to the identity data of the conference member; when the target audio is leaked, the conference member of the target audio can be generated by detecting the audio watermark added in the target audio and then tracking reversely based on the audio watermark, namely, the conference member of the target audio is positioned.

According to the audio processing method, after the user identification logged in at present participates in the target online conference, conference audio generated in the target online conference and the audio watermark corresponding to the user identification can be obtained, and then the audio watermark is added into the conference audio to obtain the target audio in which the conference content of the target online conference is recorded. In this way, because the audio watermark is generated according to the identity data of conference members, in an objective aspect, users revealing the audio can be reversely tracked by detecting the watermark in the audio, so that the safety of conference contents is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

In one embodiment, acquiring conference audio generated in a target online conference includes: receiving voice data packets generated by each conference member in the target online conference; the number of conference members in the target online conference is more than one; processing each voice data packet through more than one logic channel to obtain each conference voice; mixing the conference voices to obtain first conference audio generated in the target online conference.

Wherein the target online conference is a multi-person conference, i.e. the number of conference members in the target online conference is more than one. Each conference member in the target online conference may generate a sound signal in the target online conference that is typically communicated in the form of a data packet as it passes between the terminal and the server.

Specifically, when the terminal or an application program running on the terminal participates in the target online conference through the locally logged-in user identifier according to user operation, a local microphone can be started, a sound signal generated in the process of the target online conference is picked up through the local microphone, and the sound signal is transmitted to the server in a voice data packet after being coded. After receiving the voice data packet transmitted by the terminal or the application program running on the terminal, the server can directly send the voice data packet to the terminal or the application program running on the terminal corresponding to other conference members participating in the target online conference. Thus, the terminal or the application program running on the terminal receives the voice data packet generated by each conference member in the target online conference.

It will be appreciated that in general, when the number of conference members of the target online conference is more than one, the number of voice packets received by the terminal is also more than one. Also, it may be that one conference member corresponds to one voice data packet. However, in other embodiments, there may be some conference members that have not generated sound information during the conference, and the number of voice packets is less than the number of conference members.

Because the number of voice data packets received by the terminal or an application running on the terminal is more than one; then the terminal or an application running on the terminal may create more than one logical channel to process these voice data packets separately.

In one embodiment, processing each voice data packet through more than one logical channel to obtain each member voice includes: creating logical channels the same as the number of voice data packets; decoding each voice data packet through each logic channel to obtain voice data; and continuing to perform downlink voice processing on each voice data through each logic channel to obtain conference voice corresponding to each voice data packet.

Specifically, the terminal or an application running on the terminal may create the same number of logical channels as voice data packets after receiving more than one voice data packet. That is, the number of logical channels is the same as the number of voice data packets, so that one logical channel can be used to process one voice data packet.

Further, the terminal or an application running on the terminal can decode one of the voice data packets through one of the logic channels to obtain voice data corresponding to the voice data packet; and continuing to perform downlink voice processing on the voice data through the logic channel to obtain conference voice corresponding to the voice data packet. The downlink voice processing includes processing procedures such as gain control and the like.

For example, referring to FIG. 3, a dataflow diagram of audio processing in one embodiment is shown. In this embodiment, the terminal receives voice data packets (i.e., M voice data packets) of other M users participating in the target online conference, then creates M logical channels, and processes the voice data packets in parallel using different logical channels. For example, voice data packet 1 is processed through logical channel 1, voice data packet 2, … is processed through logical channel 2, and voice data packet M is processed through logical channel M. The processing includes decoding by a decoder and downstream speech processing.

In this embodiment, the voice data packets are processed by using different logic channels, so that the processing of the voice data packets is independent from each other, and the mutual influence of the processing process can be avoided.

Further, after each voice data packet is processed through more than one logical channel to obtain each conference voice, the terminal or an application running on the terminal can continue to mix each conference voice to obtain the first conference audio generated in the target online conference. It will be appreciated that the first conference audio is speech generated by other users participating in the target online conference during the target online conference, in addition to the local user.

Thereafter, with continued reference to fig. 3, the terminal or an application running on the terminal may pass the first conference audio to a local speaker to play the first conference audio locally.

In the above embodiment, the voice data packets are respectively processed by using different logic channels, so as to obtain conference voices of each conference member respectively, and then the conference voices are mixed to obtain complete received voices.

In one embodiment, when the local microphone is not turned on, or a valid sound signal is not acquired, the terminal or an application running on the terminal may directly add an audio watermark to the first conference audio, resulting in the first target audio. It will be appreciated that this scenario applies to situations where the local microphone does not need to be on, or where conference members participating in the targeted online conference do not need to speak. For example, local conference members need only to listen to other participants to speak, or have no right to speak, etc.

In one embodiment, the audio processing method further comprises: acquiring locally recorded conference voice; mixing the first conference audio with the locally recorded conference voice to obtain second conference audio generated in the target online conference.

Specifically, the terminal or an application running on the terminal may acquire conference voice recorded by the local microphone, and mix the first conference audio with the locally recorded conference voice to obtain second conference audio generated in the target online conference. It will be appreciated that the second conference audio is audio generated in the target online conference entirely after mixing locally received speech with locally intended speech. And the terminal or an application program running on the terminal adds the audio watermark into the second conference audio to obtain a second target audio. This results in the final watermarked complete conference audio. With continued reference to fig. 3, after the conference voice output by the plurality of logical channels is mixed, the conference voice is mixed with the conference voice recorded by the microphone, and then the audio watermark is embedded into the secondary mixing result to obtain the target audio, namely the conference audio added with the audio watermark.

In this embodiment, the received voice and the voice to be transmitted are mixed and then the audio watermark is added, so that the complete audio file with the watermark is obtained. Thus, the security of the audio file can be improved by adding the watermark, and the conference content can be completely reproduced by playing the audio file.

In one embodiment, acquiring conference audio generated in a target online conference includes: receiving a voice data packet generated in a target online conference; the voice data packet is obtained after conference voice generated by each conference member in the target online conference is mixed; decoding the voice data packet through a logic channel to obtain voice data; and continuing to perform downlink voice processing on the voice data through the logic channel to obtain a first conference audio.

Specifically, after receiving the voice data packets transmitted by each terminal or the application programs running on each terminal, the server mixes the voice data in the voice data packets and then issues the voice data to the terminals or the application programs running on the terminals corresponding to other conference members participating in the target online conference. Thus, the terminal or an application running on the terminal receives the voice data packet generated in the target online conference.

Further, the terminal or an application running on the terminal may create a logical channel to process the voice data packet after receiving the voice data packet. Specifically, decoding the voice data packet through the logic channel to obtain voice data corresponding to the voice data packet; and continuing to perform downlink voice processing on the voice data through the logic channel to obtain first conference audio generated in the target online conference. The downlink voice processing includes processing procedures such as gain control and the like.

For example, referring to fig. 4, a data flow diagram of audio processing in another embodiment is shown. In this embodiment, the server mixes the multiple voices to be sent to the terminal, so that the receiving end only needs to receive one voice signal, i.e. one voice data packet. After receiving the voice data packet, the terminal then creates a logical channel, and processes the voice data packet by using the logical channel. The processing includes decoding by a decoder and downstream speech processing.

In this embodiment, only one path of voice signal is processed, so that the processing efficiency of the terminal on the audio frequency can be improved, and the computing pressure of the terminal is further released.

In one embodiment, receiving voice data packets generated in a target online meeting includes: receiving a voice data packet which is sent by a server and corresponds to a target online conference; the voice data packet is obtained by mixing conference voice generated by each conference member in the target online conference by the server and adding a conference watermark; the meeting watermark is generated according to the meeting identification of the target online meeting.

The conference watermark is also protection information embedded in the audio carrier file by using a computer algorithm. In this embodiment, the audio carrier file is conference audio of a target online conference. The meeting identification is used to uniquely identify a target online meeting. After the conference watermark is added, the conference audio can be used for representing the conference audio as the conference audio of the target online conference from which the conference watermark is recorded, namely the conference audio can be used for tracing the target online conference from which the conference audio is recorded.

Specifically, after receiving the voice data packets transmitted by each terminal or the application programs running on each terminal, the server may mix the voice data in the voice data packets, then add the conference watermark to the mixed voice obtained by mixing, and then issue the conference watermark to the terminal or the application program running on the terminal corresponding to other conference members participating in the target session target online conference. Thus, the terminal or an application running on the terminal receives the voice data packet generated in the target session target online conference. The terminals herein are terminals to which conference members participating in the target online conference correspond.

When the conference watermark is added into the mixed voice, the server can also adopt a watermark embedding algorithm, wherein the watermark embedding algorithm can be a coefficient quantization method, a spatial domain algorithm, a transform domain algorithm, a least significant bit algorithm, an echo hiding algorithm, a phase encoding algorithm and the like. The watermark embedding algorithm can realize adding specific noise watermark, echo watermark or out-of-band signal which is not easily perceived by human ears, etc. to the mixed voice, or can realize amplitude or phase modification or specific frequency notch, etc. to the mixed voice. The conference watermarking process may be performed in the time domain, the frequency domain, or any transform domain (e.g., wavelet transform DWT, discrete cosine transform DCT, complex modulated lapped transform MCLT, etc.).

In this embodiment, before the server issues the voice data packet to each terminal, the server adds the conference watermark to the conference voice of the target online conference, and then adds the audio watermark locally. Therefore, the target audio can be prevented from being leaked through double watermark protection of the conference watermark and the audio watermark, when the target audio is leaked, the target online conference from which the target audio is derived is positioned through the conference watermark, and then the conference member generating the target audio is positioned among the conference members of the target online conference through the audio watermark.

It may be appreciated that, when the leakage is prevented by the dual watermark protection of the conference watermark and the audio watermark in the present embodiment, the identity data according to which the audio watermark is generated may be data that uniquely identifies the identity of one conference member in the target online conference, and the identity data of conference members of different online conferences may be the same. That is, a user is uniquely identified by the meeting identification and the identity data together.

In other embodiments, the session watermark may also be an audio watermark, so that on the one hand, double protection may be achieved by adding two watermarks to the conference audio through the server and locally. On the other hand, the sound played through the speaker will also carry an audio watermark, which sound, if re-recorded by other recording devices, can still be back tracked by detecting the audio watermark. And can be prevented.

Specifically, the terminal or an application running on the terminal may acquire conference voice recorded by the local microphone, and mix the first conference audio with the locally recorded conference voice to obtain second conference audio generated in the target online conference. It will be appreciated that the second conference audio is audio generated in the target online conference entirely after mixing locally received speech with locally intended speech. And the terminal or an application program running on the terminal adds the audio watermark into the second conference audio to obtain a second target audio. This results in the final watermarked complete conference audio. With continued reference to fig. 4, the first conference audio output by the logic channel is mixed with conference voice recorded by the microphone, and then an audio watermark is embedded into the mixed result to obtain a target audio, i.e. the conference audio added with the audio watermark.

In one embodiment, the audio processing method further comprises: adding the audio watermark into the first conference audio to obtain first target audio of the target online conference; the first target audio is played.

Specifically, the first conference audio obtained by the terminal or the application running on the terminal through the foregoing embodiment includes sound signals generated by other conference members participating in the target online conference. In the normal process of the target online meeting, the first meeting audio needs to be transmitted to a loudspeaker for playing.

In this embodiment, before the first conference audio is transferred to the speaker, an audio watermark may also be added to the first conference audio to obtain a first target audio, and then the first target audio is played through the speaker. For example, referring to fig. 5, the server mixes the multiple voices to be sent to the terminal, so the receiving end only needs to receive one voice signal, i.e. one voice data packet. After receiving the voice data packet, the terminal then creates a logical channel, and processes the voice data packet by using the logical channel. The processing includes decoding by a decoder and downstream speech processing. The first conference audio output by the logic channel is embedded with an audio watermark on one hand, and is played by a loudspeaker after the first target audio is obtained, on the other hand, the first conference audio is mixed with conference voice recorded by a microphone, then the audio watermark is embedded into the mixed result, and a second target audio, namely, the complete conference audio added with the audio watermark, is obtained.

In this way, the sound played through the speaker will also carry an audio watermark, which sound, if re-recorded by other recording devices, can still be back tracked by detecting the audio watermark. The audio watermark is added in the same manner as or different from the manner in which the audio watermark is added to the second target audio, wherein the audio watermark is mixed with conference voice recorded by the microphone, and then the audio watermark is embedded into the mixed result.

In one embodiment, adding an audio watermark to conference audio results in target audio for a target online conference, comprising: dividing conference audio into more than one conference audio clip according to a preset time interval; respectively adding the audio watermarks into more than one conference audio fragment to obtain more than one target audio fragment; and splicing more than one target audio clip according to the arrangement sequence of more than one conference audio clip to obtain the target audio of the target online conference.

It will be appreciated that in some embodiments, the duration of the conference audio of the target online conference is long, and that the audio watermark may not cover the entire content of the conference audio when added to the conference audio. To avoid situations where conference audio leaks after being cut, audio watermarking may be added to the conference audio in a loop.

Specifically, the terminal or an application running on the terminal may divide conference audio into more than one conference audio piece according to a preset time interval; and then adding the audio watermark to the more than one conference audio clips respectively to obtain more than one target audio clip, so that one conference audio clip corresponds to one target audio clip, namely, each conference audio clip is added with the audio watermark. And splicing the target audio clips corresponding to the conference audio clips according to the arrangement sequence of more than one conference audio clip to obtain the target audio of the target online conference. Since the audio watermark is circularly embedded from the beginning of the conference audio; in this way, in the event that conference audio is cut, at least one audio watermark can still be detected from the remaining audio pieces, thereby tracking the users who have leaked conference audio.

In a specific embodiment, the watermark information of the audio watermark includes identity data of conference members, and the audio watermark is added to the conference audio after being converted into a watermark signal. It will be appreciated that the different angles used to analyze a signal are referred to as domains, the time and frequency domains being the fundamental properties of the signal, when the signal is described from a time domain perspective, i.e., the time domain signal, and when the signal is described from a frequency domain perspective, i.e., the frequency domain signal. Thus, conference audio has corresponding audio time domain signals and audio frequency domain signals, and the audio time domain signals and the audio frequency domain signals may be transformed into each other. Then, when adding watermark information in conference audio, the watermark information may be based on an audio time domain signal or an audio frequency domain signal. The watermark signal converted from the watermark information may be a time domain signal or a frequency domain signal. Of course, in other embodiments, other domains than the time and frequency domains are also applicable to both of these domains.

For example, fig. 6 shows watermark information to be added, where "12345" is a user number or specification information to be embedded, "a" is a marker example for separating user numbers, and "a12345" constitutes watermark information of a complete set of audio watermarks. Assuming that the preset time interval is M, the conference audio may be divided into conference audio segments every M from the start time, and each conference audio segment is embedded with watermark information. For example, m=5 seconds, each conference audio piece with a time length of 5 seconds embeds watermark information "a12345". Thus, from the beginning of the conference audio, "a12345" is continuously looped back into the conference audio. Wherein, when embedding the watermark information "a12345" into the conference audio segment, the watermark information "a12345" is generated to be embedded into the conference audio segment, and the watermark signal may be a time domain signal, a frequency domain amplitude variation indicating signal, or an amplitude variation indicating signal of any transform domain.

In one embodiment, adding audio watermarks to each conference audio clip separately results in more than one target audio clip, including: generating a time domain watermark signal of the audio watermark; and adding the time domain watermark signal and each conference audio fragment according to the time domain sample point by bit to obtain more than one time domain target audio fragment.

Specifically, the terminal or an application running on the terminal can perform watermark adding in the time domain to obtain an audio signal of the conference audio in the time domain, generate a time domain watermark signal by watermark information of the audio watermark, and then add the time domain watermark signal to the audio signal of each segment of the conference audio in the time domain to obtain a target audio segment to which the time domain watermark signal is added for each segment of the conference audio. Wherein, since the time domain watermark signal is added to the time domain audio signal, the time domain watermark signal can be added according to time domain sample points.

In a specific embodiment, watermarking in the time domain is taken as an example. Watermark information "a12345" of the audio watermark includes a marker a and a user number 12345. Converting watermark information "a12345" into binary numbers according to ASCII codes: 01000001, 00110001, 00110010, 00110011, 00110100, 00110101. These binary numbers can then be arranged into a vector SigVec containing 48 bit values, i.e., sigvec= 010000010011000100110010001100110011010000110101.

Assuming that the audio sampling rate of the conference audio is 24kHz and the conference audio is divided into one segment every 5 seconds, each segment of conference audio clip contains 24000 x 5 samples. If each sample embeds one bit of information, then a 5 second audio segment can be repeatedly embedded 24000 x 5/48=2500 SigVec times. Thus, each bit in SigVec can be extended 2500 times to get an extended bit vector ESigVec, ESigVec that is the same length as the number of samples of a 5 second conference audio clip.

In addition, in order to prevent the watermark signal from affecting the tone quality of conference audio as much as possible, 0 in the above ESigVec signal may be replaced by-1 and multiplied by an amplitude factor alpha, for example 0.01, to obtain the desired watermark signal. At this time, the value corresponding to bit 0 is-0.01, and the value corresponding to bit1 is 0.01. The terminal or an application running on the terminal may then add each value of the resulting watermark signal to each audio sample of the conference audio piece to obtain the target audio piece of the embedded watermark signal. The "a12345" watermark information may be embedded in each conference audio clip by repeating the above operations for each conference audio clip.

In this embodiment, a way of watermark adding in the time domain is provided, which can reduce the damage to the audio quality to the maximum extent, achieve the effect that the human ear cannot hear the additional watermark information, and not affect the normal conversation and the playback tone quality of the audio.

Further, after the audio watermark is added to each conference audio segment, the terminal or the application running on the terminal can sequentially connect each segment of conference audio segment after the watermark signal is added, and then the complete conference audio carrying watermark information can be regenerated and stored as an audio file. The audio file may be available for subsequent playback.

In the above embodiment, the audio watermark is circularly embedded into the conference audio, and when the conference audio is cut, at least one watermark information can still be detected from the remaining audio clips, so that the users revealing the conference audio can be tracked, and the security of the conference content can be improved.

In one embodiment, the target online meeting is an audio-video meeting; the method for participating in the target online conference through the currently logged-in user identification comprises the following steps: initiating and participating in an audio-video conference through a currently logged-in user identifier; setting the security level of the audio-video conference to be higher than a preset security level according to user operation; or, setting the recording mode of the audio-video conference to be an audio watermarking state according to the user operation.

It will be appreciated that adding an audio watermark to conference audio is an optional operation. An administrator of the conference or an initiator of the target online conference has the right to choose whether or not to add an audio watermark to the conference audio. When a manager of the conference or an initiator of the target online conference selects to add an audio watermark to conference audio of the target online conference, namely, when an audio watermark adding function of the target online conference is started, conference members participating in the target online conference can add the audio watermark to the conference audio when recording the conference audio. The administrator of the conference may specifically be a user who logs in to the application program with an administrator identity. Such as a conference side administrator, etc.

Specifically, the currently logged-in user identification may be the initiator of the target online conference or may be a participant invited to participate in the target online conference. When the currently logged-in conference member identified by the user identifier is an initiator of the target online conference, the terminal or an application running on the terminal can set the security level of the audio-video conference to be higher than a preset security level according to user operation; or, setting the recording mode of the audio-video conference to be an audio watermarking state according to the user operation. When the security level of the audio-video conference is higher than a preset security level, the recording mode of the audio-video conference is automatically set to be in an audio watermark adding state. The current terminal can transmit the setting result to the server, and the server informs the corresponding terminal of the conference member participating in the audio-video conference through signaling, and sets the recording mode of the audio-video conference to be an audio watermark adding state. The recording mode of the audio-video conference is an audio watermark adding state, and the audio watermark is added when the audio of the conference is recorded.

For example, with continued reference to fig. 1, a conference administrator may log into the server through a terminal to control whether the audio watermarking function is enabled. The conference initiator has administrator authority to the conference initiated by the conference initiator, can log in a server running on the terminal, configures the audio watermark adding function of the conference, and can specify the security level of the conference when the conference is initiated or directly and manually specify the starting of the audio watermark function. The server then informs each terminal of the participant whether to start the audio watermarking function or not through signaling.

In this embodiment, whether to add the watermark to the conference audio can be set in a self-defined manner according to the requirement, so that the watermark can be flexibly set according to the conference content. The watermark is added to the conference setting related to privacy or sensitive information, and the watermark is not needed to be added to other contents, so that terminal resources are reasonably utilized, and resource waste is avoided.

In one embodiment, as shown in fig. 7, an audio processing method is provided. The embodiment is mainly exemplified by the method applied to computer equipment. The computer device may be the terminal 110 (or an application running on the terminal 110, or the server 120) described above in fig. 1. Referring to fig. 7, the audio processing method specifically includes the steps of:

s702, acquiring target audio corresponding to a target online conference; the target audio is audio obtained by adding an audio watermark into conference audio of the target online conference; the audio watermark is generated from the identity data of the conference member.

S704, separating the target audio to obtain an audio watermark.

In particular, the computer device may perform a separation operation on the target audio to obtain an audio watermark and conference audio other than the audio watermark. The watermark extraction algorithm can be a coefficient quantization method, a spatial domain algorithm, a transform domain algorithm, a least significant bit algorithm and the like, and the watermark extraction algorithm adopted when the separation operation is executed is matched with the watermark embedding algorithm adopted when the watermark information is added.

S706, analyzing the audio watermark to obtain the identity data.

Specifically, the computer device may parse the audio watermark to obtain the identity data according to an inverse process of generating the audio watermark from the identity data.

S708, determining conference members corresponding to the identity data among conference members participating in the target online conference.

Since the correspondence between conference members and identity data has been established in advance. Therefore, when the computer equipment acquires the identity data, the established corresponding relation can be queried according to the identity data, and the conference member corresponding to the identity data is obtained by comparing the identity data with the identity data in the corresponding relation. Thus, the conference member corresponding to the identity data among the conference members participating in the target online conference can be determined.

According to the audio processing method, after the target audio corresponding to the target online conference is obtained, the target audio can be separated to obtain the audio watermark, and the identity data for generating the audio watermark is analyzed, so that conference members corresponding to the identity data among conference members participating in the target online conference are determined. Because the target audio is audio obtained by adding an audio watermark in conference audio of the target online conference, and the audio watermark is generated according to identity data of conference members; in an objective aspect, users who leak audio are reversely tracked, and the safety of conference contents is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

In one embodiment, the target audio is conference audio after audio watermarking is added to the conference audio of the target online conference in a segmented manner; performing a separation operation on the target audio to obtain an audio watermark, including: sequentially intercepting target audio fragments from target audio according to a preset step length through an intercepting window; intercepting the window length of a window, which is the segment length when an audio watermark is added to conference audio of a target online conference in a segmentation way; and sequentially separating the target audio fragments until the audio watermark is obtained.

The window length of the intercepting window is the segment length when the audio watermark is added to the conference audio of the target online conference in a segmented mode, namely, the preset time interval of the conference audio is divided in the embodiment. The window length of the intercepting window is the duration of conference audio added with an audio watermark. The preset step size is the time interval during which the capture window moves when capturing a clip on the conference audio. The target audio clips are sequentially intercepted from the target audio according to the preset step length by the intercepting window, so that the problem that the complete watermark cannot be extracted when the conference audio of the target online conference is intercepted at will can be avoided.

It will be appreciated that in adding an audio watermark to conference audio, the conference audio is divided into more than one conference audio piece and audio tags are added separately. Then, when extracting the conference audio watermark, the extraction should also be performed from the conference audio clips of the same length to extract the complete audio watermark.

Specifically, the computer device may sequentially intercept the target audio segments from the target audio according to a preset step size through the intercept window, and sequentially perform a separation operation on the target audio segments until the audio watermark is obtained. For example, assuming that one watermark is embedded every M seconds when an audio watermark is added to conference audio, the capture window is M seconds. Assuming that the preset step length is N seconds, from the starting time of the target audio, the target audio segment of M seconds is intercepted every N seconds to extract the audio watermark. Such as intercepting a 5 second target audio clip every 0.1 s.

In other embodiments, the window length of the intercepting window may be slightly greater or less than the preset time interval for dividing conference audio in the previous embodiments. For example, a watermark is embedded every 5 seconds, and the audio clip is intercepted at the time of actual detection by 4.9 seconds or 5.1 seconds.

In a specific embodiment, a way of audio watermark extraction is provided for the target audio resulting from the audio watermarking in the time domain in the previous embodiment. In particular, the computer device may divide the target audio in the time domain into 5 seconds. Since each bit was embedded 2500 times in the previous embodiment, the 2500 samples are added. In addition, since the amplitude of the audio sample obeys the Gaussian distribution with the average value of 0, the added value is larger than zero and approaches to alpha to represent bit1 according to the central limit theorem. Thus, a threshold PTh may be set, representing bit1 if greater than PTh; similarly, if less than 0 and approaching negative alpha, it represents bit 0, then a negative threshold NTh may be set again, and if less than NTh, it represents bit 0. And the same way, all 48 bits in a target audio fragment can be obtained, and bit information is restored into watermark information of the audio watermark through ASCII code mapping.

In the above embodiment, since the audio watermark is circularly embedded in the conference audio, when the conference audio is cut, at least one watermark information can still be detected from the remaining audio segments, so that the users revealing the conference audio can be tracked, and the security of the conference content can be improved.

In a specific embodiment, a conference application is running on the terminal, through which a user can participate in an audio-video conference. It should be noted that, the conference application will generally provide a conference recording function for recording and playing back important content. However, recorded audio in an audio-video conference may be illegally copied and propagated, resulting in sensitive information leakage, thereby reducing the security of conference contents.

Specifically, the first terminal logs in and runs the conference application program through the user identification according to the user operation, and then initiates the target conference. The first terminal exchanges information with the call control server through signaling and establishes connection. The call control server allocates a physical channel for voice call based on the target conference, creates a logical channel, and allocates a unique user number to each user accessing the target conference. And the server transmits the user number to the second terminal corresponding to the user accessing the target conference through signaling.

The first terminal can also set the security level of the target conference to be higher than a preset security level according to user operation; or, setting the recording mode of the target conference to be an audio watermarking state according to the user operation. And transmitting the setting result to a call control server, and controlling a second terminal corresponding to the user accessing the target conference by the call control server to set the recording mode of the target conference as an audio watermark adding state, namely starting an audio watermark adding function when the conference application program is recorded.

Referring to fig. 8, a conference application running on a terminal (including a first terminal and a second terminal) may obtain conference audio generated in a target conference and obtain an audio watermark corresponding to a locally logged-in user identification. The terminal divides the conference audio into more than one conference audio clip according to the preset time interval; respectively adding the audio watermarks into each conference audio fragment to obtain more than one target audio fragment; and splicing more than one target audio fragment according to the arrangement sequence of more than one conference audio fragment to obtain the target audio of the target conference.

When conference audio added with audio watermark is leaked, referring to the lower diagram of fig. 8, after the computer equipment (including the terminal and the call control server) acquires the target audio corresponding to the target conference, the target audio fragments can be sequentially intercepted from the target audio according to a preset step size through an intercepting window; sequentially separating the target audio fragments until an audio watermark is obtained; and then analyzing the audio watermark to obtain identity data, and determining conference members corresponding to the identity data from conference members participating in the target conference.

Therefore, the audio watermark is generated according to the identity data of the conference members, so that the users revealing the audio can be reversely tracked by detecting the watermark in the audio in an objective aspect, the safety of the conference content is improved, and an effective legal or management supervision means is provided. In subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, the safety of the conference content is improved, and legal risks or business losses are reduced. In addition, the audio processing method can be applied to different broadband voice calls, and the audio watermark can still be detected for the audio subjected to certain sound effect processing or clipping.

It should be understood that, although the steps in the flowcharts of the above embodiments are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the embodiments described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of other steps or sub-steps of other steps.

As shown in fig. 9, in one embodiment, an audio processing apparatus 900 is provided. Referring to fig. 9, the audio processing apparatus 900 includes: a participation module 901, an acquisition module 902, and an addition module 903.

A participation module 901, configured to participate in a target online conference through a currently logged-in user identifier.

An acquisition module 902, configured to acquire conference audio generated in a target online conference; acquiring an audio watermark corresponding to a user identifier; the audio watermark is generated from the identity data of the conference member identified by the user identification.

An adding module 903, configured to add an audio watermark to conference audio, to obtain target audio of a target online conference; the audio watermark added in the target audio is used for locating conference members generating the target audio when the target audio is leaked.

In one embodiment, the obtaining module 902 is further configured to receive a voice data packet generated by each conference member in the target online conference; the number of conference members in the target online conference is more than one; processing each voice data packet through more than one logic channel to obtain each conference voice; mixing the conference voices to obtain first conference audio generated in the target online conference.

In one embodiment, the obtaining module 902 is further configured to create the same number of logical channels as voice data packets; decoding each voice data packet through each logic channel to obtain voice data; and continuing to perform downlink voice processing on each voice data through each logic channel to obtain conference voice corresponding to each voice data packet.

In one embodiment, the obtaining module 902 is further configured to receive a voice data packet generated in the target online meeting; the voice data packet is obtained after conference voice generated by each conference member in the target online conference is mixed; decoding the voice data packet through a logic channel to obtain voice data; and continuing to perform downlink voice processing on the voice data through the logic channel to obtain a first conference audio.

In one embodiment, the obtaining module 902 is further configured to receive a voice data packet sent by the server and corresponding to the target online conference; the voice data packet is obtained by mixing conference voice generated by each conference member in the target online conference by the server and adding a conference watermark; the meeting watermark is generated according to the meeting identification of the target online meeting.

In one embodiment, the obtaining module 902 is further configured to obtain locally recorded conference voice; mixing the first conference audio with the locally recorded conference voice to obtain second conference audio generated in the target online conference.

As shown in fig. 10, in one embodiment, the adding module 903 is further configured to add an audio watermark to the first conference audio, to obtain a first target audio of the target online conference. The audio processing apparatus 900 further comprises a playing module 904 for playing the first target audio.

In one embodiment, the participation module 901 is further configured to establish a connection with the server through signaling based on the currently logged-in user identification, so as to participate in the target online conference; receiving the identity data of the conference member identified by the user identification, which is issued by the server through signaling; the identity data of the different conference members participating in the target online conference are different. The obtaining module 902 is further configured to generate an audio watermark corresponding to the user identifier according to the identity data.

In one embodiment, the adding module 903 is further configured to divide the conference audio into more than one conference audio piece at a preset time interval; respectively adding the audio watermarks into more than one conference audio fragment to obtain more than one target audio fragment; and splicing more than one target audio clip according to the arrangement sequence of more than one conference audio clip to obtain the target audio of the target online conference.

In one embodiment, the adding module 903 is further configured to generate a time domain watermark signal of the audio watermark; and adding the time domain watermark signal and each conference audio fragment according to the time domain sample point by bit to obtain more than one time domain target audio fragment.

In one embodiment, the target online meeting is an audio video meeting. The participation module 901 is also used for initiating and participating in an audio/video conference through the currently logged-in user identification; setting the security level of the audio-video conference to be higher than a preset security level according to user operation; or, setting the recording mode of the audio-video conference to be an audio watermarking state according to the user operation.

After participating in the target online conference through the currently logged-in user identifier, the audio processing device can acquire conference audio generated in the target online conference and an audio watermark corresponding to the user identifier, and then adds the audio watermark into the conference audio to obtain target audio in which conference contents of the target online conference are recorded. In this way, because the audio watermark is generated according to the identity data of the conference member, in objective aspect, the user revealing the audio can be reversely tracked by detecting the watermark in the audio, so that the safety of the conference content is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

As shown in fig. 11, in one embodiment, an audio processing device 1100 is provided. Referring to fig. 11, the audio processing apparatus 1100 includes: an acquisition module 1101, a separation module 1102 and a determination module 1103.

An obtaining module 1101, configured to obtain target audio corresponding to a target online conference; the target audio is audio obtained by adding an audio watermark into conference audio of the target online conference; the audio watermark is generated from the identity data of the conference member.

And the separation module 1102 is configured to perform a separation operation on the target audio to obtain an audio watermark.

A determining module 1103, configured to parse the audio watermark to obtain identity data; and determining conference members corresponding to the identity data among conference members participating in the target online conference.

In one embodiment, the target audio is conference audio after audio watermarking is segmented in conference audio of the target online conference. The separation module 1102 is further configured to sequentially intercept target audio segments from the target audio according to a preset step size through the interception window; intercepting the window length of a window, which is the segment length when an audio watermark is added to conference audio of a target online conference in a segmentation way; and sequentially separating the target audio fragments until the audio watermark is obtained.

After the audio processing device acquires the target audio corresponding to the target online conference, the audio processing device can separate the target audio to obtain the audio watermark, and further analyze and generate the identity data of the audio watermark, so that conference members corresponding to the identity data among conference members participating in the target online conference are determined. Because the target audio is audio obtained by adding an audio watermark in conference audio of the target online conference, and the audio watermark is generated according to identity data of conference members; in an objective aspect, users who leak audio are reversely tracked, and the safety of conference contents is improved; in subjective aspect, if a user knows that the user can reversely track the leaked audio based on the watermark in the audio, the user can give up the target audio heart for randomly scattering and recording the conference content, the leakage probability of the conference content is reduced, and the safety of the conference content is improved.

FIG. 12 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the terminal 110 of fig. 1. As shown in fig. 12, the computer device includes a processor, a memory, a network interface, an input device, a display screen, a microphone, and a speaker connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement an audio processing method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the audio processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

FIG. 13 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the server 120 of fig. 1. As shown in fig. 13, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement an audio processing method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the audio processing method.

It will be appreciated by those skilled in the art that the structures shown in fig. 12 and 13 are merely block diagrams of portions of structures associated with the inventive arrangements and are not limiting of the computer device to which the inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the audio processing apparatus provided by the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 12 or 13. The memory of the computer device may store various program modules constituting the audio processing apparatus, such as a participation module 901, an acquisition module 902, and an addition module 903 shown in fig. 9. The computer program constituted by the respective program modules causes the processor to execute the steps in the audio processing method of the respective embodiments of the present application described in the present specification.

For example, the computer apparatus shown in fig. 12 may perform the step of participating in the target online conference by the currently logged-in user identification through the participation module 901 in the audio processing device as shown in fig. 9. Performing, by the acquisition module 902, a step of acquiring conference audio generated in the target online conference; and obtaining an audio watermark corresponding to the user identifier; the audio watermark is generated from the identity data of the conference member identified by the user identification. A step of adding an audio watermark to conference audio through an adding module 903 to obtain target audio of a target online conference; the audio watermark added in the target audio is used for locating conference members generating the target audio when the target audio is leaked.

For another example, the computer apparatus shown in fig. 13 may perform the step of acquiring the target audio corresponding to the target online conference by the acquisition module 1101 in the audio processing device shown in fig. 11; the target audio is audio obtained by adding an audio watermark into conference audio of the target online conference; the audio watermark is generated from the identity data of the conference member. And performing a separation operation on the target audio by the separation module 1102 to obtain an audio watermark. The step of parsing the audio watermark to obtain identity data is performed by the determining module 1103; and determining conference members corresponding to the identity data among conference members participating in the target online conference.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described audio processing method. The steps of the audio processing method herein may be the steps in the audio processing method of the above-described respective embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described audio processing method. The steps of the audio processing method herein may be the steps in the audio processing method of the above-described respective embodiments.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An audio processing method, comprising:

turning on a local microphone;

receiving a voice data packet which is sent by a server and corresponds to the target online conference; the voice data packet is obtained by mixing conference voice generated in the target online conference by other user targets participating in the target online conference except a local user and adding a conference watermark; the conference watermark is generated according to the conference identifier of the target online conference;

Decoding the voice data packet through a logic channel to obtain voice data;

continuing to perform downlink voice processing on the voice data through the logic channel to obtain a first conference audio;

adding the audio watermark into the first conference audio to obtain first target audio of the target online conference, and playing the first target audio through a local loudspeaker;

acquiring conference voice of a local user recorded by the local microphone in the process of carrying out the target online conference, and mixing the first conference audio with the conference voice recorded by the local microphone to obtain second conference audio generated in the target online conference;

adding the audio watermark to the second conference audio to obtain second target audio of the target online conference; the audio watermark added in the second target audio is used for positioning conference members generating the second target audio when the second target audio is leaked.

2. The method according to claim 1, wherein the method further comprises:

receiving voice data packets generated in the target online conference by other users participating in the target online conference except the local user; the number of other users is more than one;

processing each voice data packet through more than one logic channel to obtain each conference voice;

and mixing the conference voices to obtain first conference audio generated in the target online conference.

3. The method of claim 2, wherein processing each voice packet through more than one logical channel to obtain each member voice comprises:

creating the same number of logical channels as the number of voice data packets;

decoding each voice data packet through each logic channel to obtain voice data;

and continuing to perform downlink voice processing on each voice data through each logic channel to obtain conference voice corresponding to each voice data packet.

4. The method of claim 1, wherein the participating target online meeting via the currently logged-in user identification comprises:

Establishing connection with a server through signaling based on the currently logged-in user identification so as to participate in a target online conference;

receiving the identity data of the conference member identified by the user identification, which is issued by the server through signaling; the identity data of the different conference members participating in the target online conference are different;

the obtaining the audio watermark corresponding to the user identifier comprises the following steps:

and generating an audio watermark corresponding to the user identifier according to the identity data.

5. The method of claim 1, wherein the adding the audio watermark to the first conference audio results in first target audio for the target online conference, comprising:

dividing the first conference audio into more than one conference audio piece according to a preset time interval;

respectively adding the audio watermarks to the more than one conference audio clips to obtain more than one target audio clip;

and splicing the more than one target audio clips according to the arrangement sequence of the more than one conference audio clips to obtain first target audio of the target online conference.

6. The method of claim 5, wherein adding the audio watermark to the more than one conference audio piece, respectively, results in more than one target audio piece, comprising:

Generating a time domain watermark signal of the audio watermark;

and adding the time domain watermark signal and each conference audio fragment according to time domain sample points in a bit-wise manner to obtain more than one time domain target audio fragment.

7. The method of claim 1, wherein the participating target online meeting via the currently logged-in user identification comprises:

initiating and participating in an audio-video conference through a currently logged-in user identifier;

setting the security level of the audio-video conference to be higher than a preset security level according to user operation; or,

and setting the recording mode of the audio-video conference to be an audio watermarking state according to user operation.

8. An audio processing method, comprising:

acquiring a second target audio corresponding to the target online conference; the second target audio is audio obtained by adding an audio watermark to second conference audio of the target online conference; the audio watermark is generated according to the identity data of the conference member;

performing separation operation on the second target audio to obtain the audio watermark;

analyzing the audio watermark to obtain identity data;

9. The method of claim 8, wherein the second target audio is conference audio after audio watermarking is segmented in a second conference audio of the target online conference; the step of performing a separation operation on the second target audio to obtain the audio watermark includes:

sequentially intercepting target audio fragments from the second target audio according to a preset step length through an intercepting window; the window length of the intercepting window is the segment length when the audio watermark is added to the second conference audio of the target online conference in a segmented manner;

and sequentially separating the target audio fragments until the audio watermark is obtained.

10. An audio processing apparatus, comprising:

the acquisition module is used for starting a local microphone; receiving a voice data packet which is sent by a server and corresponds to the target online conference; the voice data packet is obtained by mixing conference voice generated in the target online conference by other user targets participating in the target online conference except a local user and adding a conference watermark; the conference watermark is generated according to the conference identifier of the target online conference; decoding the voice data packet through a logic channel to obtain voice data; continuing to perform downlink voice processing on the voice data through the logic channel to obtain a first conference audio; acquiring an audio watermark corresponding to the user identifier; the audio watermark is generated according to the identity data of the conference member identified by the user identification;

The adding module is used for adding the audio watermark into the first conference audio to obtain first target audio of the target online conference, and playing the first target audio through a local loudspeaker; acquiring conference voice of a local user recorded by the local microphone in the process of carrying out the target online conference, and mixing the first conference audio with the conference voice recorded by the local microphone to obtain second conference audio generated in the target online conference; adding the audio watermark to the second conference audio to obtain second target audio of the target online conference; the audio watermark added in the second target audio is used for positioning conference members generating the second target audio when the second target audio is leaked.

11. The apparatus of claim 10, wherein the acquisition module is further configured to receive voice data packets generated in the target online conference by other users participating in the target online conference in addition to the local user; the number of other users is more than one; processing each voice data packet through more than one logic channel to obtain each conference voice; and mixing the conference voices to obtain first conference audio generated in the target online conference.

12. The apparatus of claim 11, wherein the acquisition module is further configured to create the same number of logical channels as the number of voice data packets; decoding each voice data packet through each logic channel to obtain voice data; and continuing to perform downlink voice processing on each voice data through each logic channel to obtain conference voice corresponding to each voice data packet.

13. The apparatus of claim 10, wherein the participation module is further configured to establish a connection with a server through signaling based on a currently logged-in user identification to participate in the target online conference; receiving the identity data of the conference member identified by the user identification, which is issued by the server through signaling; the identity data of the different conference members participating in the target online conference are different; the acquisition module is further used for generating an audio watermark corresponding to the user identifier according to the identity data.

14. The apparatus of claim 10, wherein the adding module is further configured to divide the first conference audio into more than one conference audio piece at a preset time interval; respectively adding the audio watermarks to the more than one conference audio clips to obtain more than one target audio clip; and splicing the more than one target audio clips according to the arrangement sequence of the more than one conference audio clips to obtain first target audio of the target online conference.

15. The apparatus of claim 14, wherein the adding module is further configured to generate a time domain watermark signal for the audio watermark; and adding the time domain watermark signal and each conference audio fragment according to time domain sample points in a bit-wise manner to obtain more than one time domain target audio fragment.

16. The apparatus of claim 10, wherein the participation module is further configured to initiate and participate in an audio-video conference via a currently logged-in user identification; setting the security level of the audio-video conference to be higher than a preset security level according to user operation; or setting the recording mode of the audio-video conference to be an audio watermarking state according to user operation.

17. An audio processing apparatus, comprising:

the acquisition module is used for acquiring second target audio corresponding to the target online conference; the second target audio is audio obtained by adding an audio watermark to second conference audio of the target online conference; the audio watermark is generated according to the identity data of the conference member;

the separation module is used for performing separation operation on the second target audio to obtain the audio watermark;

18. The apparatus of claim 17, wherein the second target audio is conference audio after audio watermarking in segments in a second conference audio of the target online conference; the separation module is further used for sequentially intercepting target audio fragments from the second target audio according to a preset step length through an intercepting window; the window length of the intercepting window is the segment length when the audio watermark is added to the second conference audio of the target online conference in a segmented manner; and sequentially separating the target audio fragments until the audio watermark is obtained.

19. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

20. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being processed and executed, implements the steps of the method according to any one of claims 1 to 9.