CN115700881A

CN115700881A - Conference terminal and method for embedding voice watermark

Info

Publication number: CN115700881A
Application number: CN202110796268.0A
Authority: CN
Inventors: 杜博仁; 张嘉仁; 曾凯盟
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2023-02-07

Abstract

The embodiment of the invention provides a conference terminal and a method for embedding a sound watermark. In the method, a first speech signal and a first acoustic watermark signal are received, respectively. The first voice signal is related to the voice content of the corresponding speaker of the other conference terminal, and the first sound watermark signal corresponds to the other conference terminal. The first voice signal is distributed to a host path to output a second voice signal, and the first voice watermark signal is distributed to an offload path to output a second voice watermark signal. The host path provides more digital signal processing sound effects than the offload path. And synthesizing the second speech signal and the second sound watermark signal to output a synthesized sound signal. This synthesizes the sound signal for audio playback. Thereby, a complete sound watermark signal can be output.

Description

Conference terminal and method for embedding voice watermark

Technical Field

The present invention relates to a voice conference, and in particular, to a conference terminal and a method for embedding a voice watermark.

Background

Teleconferencing enables people in different locations or spaces to converse, and conference-related devices, protocols, and applications have grown quite mature. It is noted that part of the real-time conference program may converge into a speech signal and a sound watermark signal. However, general speech signal processing techniques (e.g., band filtering, noise suppression, dynamic Range Compression (DRC), echo cancellation, etc.) are designed for general speech signals, and thus only the speech signals are retained and not excluded. If the same speech signal processing is used for the speech signal and the acoustic watermark signal in the signal transmission path, the acoustic watermark signal may be considered as noise or non-speech signal and filtered out.

Disclosure of Invention

The embodiment of the invention is directed to a conference terminal and a voice watermark embedding method.

According to the embodiment of the invention, the embedding method of the sound watermark is suitable for the conference terminal. The embedding method of the sound watermark includes (but is not limited to) the following steps: a first voice signal and a first voice watermark signal are received respectively. The first voice signal is related to the voice content of the corresponding speaker of the other conference terminal, and the first sound watermark signal corresponds to the other conference terminal. The first voice signal is distributed to a host (host) path to output a second voice signal, and the first sound watermark signal is distributed to an offload (offload) path to output a second sound watermark signal. The host path provides more Digital Signal Processing (DSP) sound than the offload path. And synthesizing the second speech signal and the second sound watermark signal to output a synthesized sound signal. This synthesizes the sound signal for audio playback.

According to embodiments of the present invention, a conference terminal includes (but is not limited to) a radio, a speaker, a communications transceiver, and a processor. The radio is used for receiving sound. The loudspeaker is used for playing sound. The communication transceiver is used for transmitting or receiving data. The processor is coupled with the radio, the loudspeaker and the communication transceiver. The processor is configured to receive a first voice signal and a first voice watermark signal through the communication transceiver, respectively, distribute the first voice signal to the host path to output a second voice signal, distribute the first voice watermark signal to the offload path to output a second voice watermark signal, and synthesize the second voice signal and the second voice watermark signal to output a synthesized voice signal. The first voice signal is related to the voice content of the corresponding speaker of the other conference terminal, and the first sound watermark signal corresponds to the other conference terminal. The host path provides more digital signal processing sound effects than the offload path. This synthesizes the sound signal for audio playback.

Based on the above, according to the conference terminal and the method for embedding the voice watermark in the embodiment of the invention, the terminal provides two transmission paths for the voice signal and the voice watermark signal respectively, so that the voice watermark signal is subjected to less signal processing, and the signal is synthesized accordingly. Therefore, the conference terminal can completely play the voice signal and the sound watermark signal of the other end speaker, and can also reduce the noise in the environment.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of a conferencing system in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a method for embedding a sound watermark according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating the generation of a speech signal and an acoustic watermark signal according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating the generation of a voice watermark signal according to an embodiment of the present invention;

FIG. 5 is a diagram of an audio processing architecture according to an embodiment of the invention.

Description of the reference numerals

1, a conference system;

10a,10c are conference terminals;

50, a cloud server;

11, a radio;

13, a loudspeaker;

15, a communication transceiver;

17, a memory;

19, a processor;

S _a 、S _a ’、S _A 、S _b ’、S _B 、S _B ' a speech signal;

W _A 、W _B a voice watermark signal;

S _B ’+W _B synthesizing a sound signal;

S210-S290, S310-S370, S410, S430;

251, an audio playing system;

271, an audio receiving system;

k _w ^A 、k _w ^B a watermark key;

w ₀ ^A 、w ₀ ^B an original watermark;

APP1, APP2 are application programs;

SFX streaming sound effect;

MFX mode sound effect;

EFX is endpoint sound effect;

OSFX, offloading streaming audio effects;

OMFX unloading mode sound effect;

L1-L4 are layers.

Detailed Description

Reference will now be made in detail to exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

Fig. 1 is a schematic diagram of a conference system 1 according to an embodiment of the present invention. Referring to fig. 1, a conference system 1 includes (but is not limited to) a plurality of

conference terminals

10a,10c and a cloud server 50.

Each

conference terminal

10a,10c may be a wired telephone, a mobile phone, a tablet computer, a desktop computer, a notebook computer, or a smart speaker. Each

conference terminal

10a,10c includes, but is not limited to, a radio 11, a speaker 13, a communication transceiver 15, a memory 17, and a processor 19.

The radio receiver 11 may be a moving coil (dynamic), capacitor (Condenser), or Electret Condenser (Electret Condenser), and the radio receiver 11 may be a combination of other electronic components, analog-to-digital converters, filters, and audio processors, which can receive sound waves (e.g., human voice, ambient sound, machine operation sound, etc.) and convert the sound waves into sound signals. In one embodiment, the radio receiver 11 is used for receiving/recording voice signals from a caller. In some embodiments, the speech signal may include the speaker's voice, the sound emitted by speaker 13, and/or other ambient sounds.

The loudspeaker 13 may be a horn or a loudspeaker. In one embodiment, the speaker 13 is used to play sound.

The communication transceiver 15 is, for example, a transceiver supporting a wired network such as an Ethernet (Ethernet), a fiber optic network, or a cable (which may include (but is not limited to) components such as a connection interface, a signal converter, a communication protocol processing chip), or a wireless network such as a Wi-Fi, a fourth generation (4G), a fifth generation (5G), or a later generation mobile network (which may include (but is not limited to) components such as an antenna, a digital-to-analog/analog-to-digital converter, and a communication protocol processing chip). In one embodiment, the communication transceiver 15 is used to transmit or receive data.

The Memory 17 may be any type of fixed or removable Random Access Memory (RAM), read Only Memory (ROM), flash Memory (flash Memory), hard Disk Drive (HDD), solid-State Drive (SSD), or the like. In one embodiment, the memory 17 is used for recording program codes, software modules, configuration configurations, data (e.g., audio signals) or files.

The processor 19 is coupled to the radio 11, the speaker 13, the communication transceiver 15 and the memory 17. The Processor 19 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other Programmable general purpose or special purpose Microprocessor (Microprocessor), digital Signal Processor (DSP), programmable controller, field Programmable Gate Array (FPGA), application-Specific Integrated Circuit (ASIC), or other similar components or combinations thereof. In one embodiment, the processor 19 is configured to execute all or part of the operations of the

conference terminals

10a and 10c, and can load and execute each software module, file and data recorded in the memory 17.

In one embodiment, the processor 19 includes a primary processor 191 and a secondary processor 193. For example, the main processor 191 is a CPU and the second processor 193 is a Platform Path Controller (PCH) or other chip or processor that consumes less power than the CPU. However, in some embodiments, the functions and/or components of the primary 191 and secondary 193 processors may be integrated.

The cloud server 50 connects

conference terminals

10a,10c directly or indirectly via the network. The cloud server 50 may be a computer system, a server, or a signal processing device. In one embodiment, the

conference terminals

10a,10c may also serve as the cloud server 50. In another embodiment, the cloud server 50 may be a separate cloud server different from the

conference terminals

10a,10c. In some embodiments, the cloud server 50 includes (but is not limited to) the same or similar communication transceiver 15, memory 17 and processor 19, and the implementation and functions of these components will not be described in detail.

Hereinafter, the method according to the embodiment of the present invention will be described with reference to various devices, components and modules in the conference system 1. The various processes of the method may be adjusted according to the implementation, and are not limited thereto.

It should be noted that, for convenience of description, the same components may perform the same or similar operations, and are not described in detail again. For example, processors 19 of

conference terminals

10a and 10c may implement the same or similar methods as embodiments of the present invention.

Fig. 2 is a flowchart of a method for embedding a sound watermark according to an embodiment of the present invention. Please refer to fig. 1 and 2, suppose

thatConference terminals

10a,10c establish a conference call. For example, a conference is established by video software, voice call software, or a call, and the speaker can start speaking. The processor 19 of the conference terminal 10a may receive the voice signals S through the communication transceiver 15 (i.e., via the network interface), respectively _B And a sound watermark signal W _B (step S210). In particular, the speech signal S _B The voice content of the corresponding speaker with respect to the conference terminal 10c (for example, the voice signal obtained by the sound receiver 11 of the conference terminal 10c receiving the sound of the speaker thereof) is obtained. And the sound watermark signal W _B Corresponding to the conference terminal 10c.

For example, FIG. 3 illustrates a voice signal S according to an embodiment of the invention _B And a sound watermark signal W _B A flow chart of (a). Referring to fig. 3, the cloud server 50 receives the voice signal S recorded by the radio 11 from the conference terminal 10c via the network interface _b ' (step S310). Speech signal S _b 'may include the speaker's voice, the sound played by speaker 13, and/or other ambient sounds. Cloud server 50 may be capable of speaking voice signal S _b ' processing a voice signal such as noise suppression, gain adjustment, etc. (step S330), and generating a voice signal S based thereon _B . However, in some embodiments, speech signal processing may also be omitted and the speech signal S may be directly processed _b ' as a speech signal S _B 。

On the other hand, the cloud server 50 can be based on the voice signal S _B Generating a sound watermark signal W for a conference terminal 10c _B . In particular, FIG. 4 illustrates an audio watermark signal W according to an embodiment of the present invention _B A flowchart of the generation of (1). Referring to fig. 4, the cloud server 50 may evaluate parameters (e.g., gain, time difference and/or frequency band) suitable for the watermark through a psychoacoustics (psychoacoustics) model (step S410). The psychoacoustic model is a mathematical model for simulating the human auditory mechanism, and can derive a frequency band that cannot be heard by the human ear. The cloud server 50 may transmit the original watermark w according to the original watermark w ₀ ^B And watermark key k _w ^B Generating an acoustic watermark signal W _B (step S430). Beard and hairIt is noted that the key algorithm used in step S430 is used for data security and integrity protection. In some embodiments, the sound watermark signal W _B Or may not be added with the watermark key k _w ^B And the original watermark w ₀ ^B Can be directly used as a sound watermark signal W _B 。

It should be noted how to acquire the voice signal S for the conference terminal 10a _a ' voice signal S _A And a sound watermark signal W _A For the speech signal S, reference may be made to the above description _b ' voice signal S _B And a sound watermark signal W _B And will not be described herein again. For example, the cloud server 50 may determine the original watermark w to be delivered ₀ ^A And a watermark key k _w ^A Generating an acoustic watermark signal W _A 。

In an embodiment, the original watermark w ₀ ^A And a sound watermark signal W- _A For identifying the conference terminal 10a, or the original watermark w ₀ ^B And the sound watermark signal WB is used to identify the conference terminal 10c. For example, the sound watermark signal W- _A To record the sound of the identifier of the conference terminal 10 a. However, in some embodiments, the invention does not limit the sound watermark signal W _A ,W _B The content of (1).

Referring to fig. 3, the cloud server 50 may receive the voice signal S _B And a sound watermark signal W _B Transmitted to the conference terminal 10a via the network interface, so that the conference terminal 10a receives the voice signal S _B And a sound watermark signal W _B Is transmitted to the conference terminal 10a (step S370). Alternatively, the cloud server 50 may receive the voice signal S _A And a sound watermark signal W _A Transmitted to the conference terminal 10c so that the conference terminal 10c receives the voice signal S _A And a sound watermark signal W _A To the conference terminal 10c.

In one embodiment, the processor 19 may receive network packets via the communication transceiver 15 via a network. The network packet comprises a speech signal S _B And a sound watermark signal W _B And both. The processor 19 may rely on the identification in the network packetDiscriminating speech signal S _B And a sound watermark signal W _B . This identifier is used to indicate that a certain part of the data payload of the network packet is a speech signal S _B And another part is a sound watermark signal W _B . For example, the identifier indicates the speech signal S _B And a sound watermark signal W _B A start position in a network packet.

In one embodiment, the processor 19 may communicate with the transceiver 15 via a network first network packet. The first network packet comprises a speech signal S _B . Further, the processor 19 may pass a second network packet through the communication transceiver 15 via the network. The second network packet comprises a voice watermark signal W _B . That is, processor 19 distinguishes between voice signals S by two or more network packets _B And a sound watermark signal W _B 。

Referring to FIG. 2, the processor 19 can convert the voice signal S _B Distributed to host (host) path to output voice signal S _B ' (step S231), and watermark the sound signal W _B Is distributed to an offload (offload) path to output an acoustic watermark signal W _B (step S233). Specifically, the conferencing device 10a may provide one or more Digital Signal Processing (DSP) audio effects to the audio stream. Digital signal processing sound effects are for example equalization processing, reverberation (reverb), echo cancellation, gain control or other audio processing. These Audio effects may also be further encapsulated into one or more Audio Processing Objects (APOs). For example, stream Effect (SFX), mode Effect (MFX), and Endpoint Effect (EFX).

FIG. 5 is a diagram of an audio processing architecture according to an embodiment of the invention. Referring to fig. 5, in the audio processing architecture, the first layer L1 is applications APP1 and APP2, the second layer L2 is an audio engine, the third layer L3 is a driver, and the fourth layer L4 is hardware. Application APP1 may be referred to as the primary application. For the application APP1, the audio engine can provide the stream sound SFX, the mode sound MFX and the endpoint sound EFX. Application APP2 may be referred to as a secondary application and provides a system pin (pin) to the driver. For the application APP2, the audio engine may provide an Offload Stream Audio OSFX (Offload Stream Effect) and an Offload Mode Audio OMFX (Offload Mode Effect) and provide an Offload pin to the driver.

In the embodiment of the invention, the Digital Signal Processing (DSP) sound effect provided by the host path is more than that provided by the unloading path. It follows that the comparison with the speech signal S _B Audio watermark signal W _B May be unaffected or less affected by digital signal processing. For example, the processor 19 is for a speech signal S _B Performing noise suppression, but the audio watermark signal W _B Is not suppressed by noise. Alternatively, the audio watermark signal W _B May be gain adjusted only and not subjected to speech related signal processing.

It should be noted that FIG. 2 shows the processor 19 for the speech signal S _B Performing speech signal processing at the receiving end and generating a watermark signal W _B Without processing the receiver-side speech signal (i.e. the output of the offload path is still the audio watermark signal W) _B ). However, in some embodiments, the sound watermark signal W _B Partial receiver-side speech signal processing (i.e., offloading the output of the path to a new audio watermark signal W) may also be performed _B )。

In one embodiment, the host path is configured for a primary application such as voice call or multimedia playback. Such as a multimedia player (Media player) in a Windows (Windows) system, or telephony software. While the offload path is configured for secondary applications such as alert tones, ring tones, or music playback. Such as a pure music player. The processor 19 may convert the speech signal S _B Linking with the main application, making the speech signal S _B May be input to the host path for the primary application. Alternatively, the processor 19 may watermark the audio signal W _B Linking with a secondary application to watermark the audio signal W _B An offload path for the secondary application may be entered.

In one embodiment, the primary processor 191 performs signal processing on the host path and the secondary processor 193 performs signal processing on the offload path. In other words, the main processor 191 pairSpeech signal S _B Providing the digital signal processing sound effect corresponding to the host path. And the second processor 193 processes the audio watermark signal W _B Providing the digital signal processing sound effect corresponding to the unloading path. For example, the second processor 193 provides less memory for the mode sound effects than the main processor 191.

Referring to FIG. 2, the processor 19 synthesizes a speech signal S _B ' and Audio watermark Signal W _B To output a synthesized sound signal S _B ’+W _B (step S250). For example, the processor 19 may Spread spectrum (Spread spectrum), echo hiding (Echo hiding), phase encoding (Phase encoding) and the like in the time domain on the voice signal S _B ' in which an acoustic watermark signal W is added _B To form a synthesized sound signal S _B ’+W _B . Alternatively, the processor 19 may modulate the carrier (Modulated carriers), subtract frequency bands (subtraction frequency bands), etc. in the frequency domain on the voice signal S _B ' in which an acoustic watermark signal W is added _B . And this synthesizes the sound signal S _B ’+W _B May be used for the audio playback system 251. For example, the processor 19 plays the synthesized sound signal S through the speaker 13 _B ’+W _B . Therefore, the audio playing system 251 can output the complete or less distorted sound watermark signal W _B 。

On the other hand, the processor 19 can obtain the voice signal S of the speaker through the audio receiving system 271 _a . For example, the processor 19 records the sound through the radio receiver 11 to obtain the voice signal S _a . The processor 19 may be adapted to voice signals S _a Processing the voice signal at the transmission end to output a voice signal S _a ' (step S290) and transmits a voice signal S through the communication transceiver 15 _a ' to cloud server 50. Similarly, the cloud server 50 can be based on the voice signal S _a ' Generation of a Speech Signal S _A And a sound watermark signal W _A . In addition, the conference terminal 10c may also output a complete or less distorted sound watermark signal W through its speaker 13 _A 。

In summary, in the conference apparatus and the method for embedding a voice watermark according to the embodiments of the present invention, the voice watermark signal is synthesized with the voice signal at the output end of the conference terminal, so as to bypass the voice signal processing of the system to embed the voice watermark. The embodiment of the invention provides a host path and an unloading path, and enables the sound watermark signal to be subjected to less signal processing or not subjected to signal processing. Therefore, the terminal can completely play the voice signal and the voice watermark of the user and can reduce the noise in the environment.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for embedding a sound watermark is suitable for a conference terminal, and is characterized in that the method for embedding the sound watermark comprises the following steps:

respectively receiving a first voice signal and a first sound watermark signal, wherein the first voice signal is related to the voice content of a speaker corresponding to another conference terminal, and the first sound watermark signal corresponds to the another conference terminal;

distributing the first voice signal to a host path to output a second voice signal, and distributing the first voice watermark signal to an offload path to output a second voice watermark signal, wherein the host path provides more digital signal processing sound effects than the offload path; and

synthesizing the second speech signal and the second audio watermark signal to output a synthesized sound signal, wherein the synthesized sound signal is used for audio playback.

2. The method of claim 1, wherein the step of receiving the first speech signal and the first sound watermark signal respectively comprises:

receiving a network packet via a network, wherein the network packet comprises the first voice signal and the first sound watermark signal; and

the first voice signal and the first voice signal are identified according to an identifier in the network packet.

3. The method of claim 1, wherein the steps of receiving the first speech signal and the first audio watermark signal respectively comprise:

receiving a first network packet via a network, wherein the first network packet comprises the first voice signal; and

receiving a second network packet via the network, wherein the second network packet includes the first sound watermark signal.

4. The method of claim 1, wherein the host path is used for voice call or multimedia playback, and the offload path is used for alert tone, ring tone or music playback.

5. The method of embedding a sound watermark according to claim 1, further comprising:

performing, by a host processor, signal processing on the host path; and

signal processing on the offload path is performed by a second processor.

6. A conference terminal, comprising:

a radio for recording;

a speaker for playing sound;

a communication transceiver for transmitting or receiving data; and

a processor coupled to the radio, the speaker, and the communications transceiver, wherein the processor is configured to:

receiving a first voice signal and a first sound watermark signal through the communication transceiver respectively, wherein the first voice signal is related to the voice content of a speaker corresponding to another conference terminal, and the first sound watermark signal corresponds to the another conference terminal;

7. The conference terminal of claim 6, wherein said processor is further configured to:

receiving, by the communications transceiver, a network packet over a network, wherein the network packet includes the first voice signal and the first sound watermark signal.

8. The conference terminal of claim 6, wherein the processor is further configured to:

receiving, by the communications transceiver, a first network packet via a network, wherein the first network packet includes the first voice signal; and

receiving, by the communications transceiver, a second network packet via the network, wherein the second network packet includes the first sound watermark signal.

9. The conference terminal of claim 6, wherein the host path is for a voice call or multimedia playback and the offload path is for an alert tone, ring tone or music playback.

10. The conference terminal of claim 6, wherein the processor comprises:

a main processor to perform signal processing on the host path; and

a second processor for performing signal processing on the offload path.