CN114337908B

CN114337908B - Method and device for generating interference signal of target voice signal

Info

Publication number: CN114337908B
Application number: CN202210011028.XA
Authority: CN
Inventors: 李军锋; 程龙彪; 姚鼎鼎; 顾建军; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2024-04-12
Anticipated expiration: 2042-01-05
Also published as: CN114337908A

Abstract

The application discloses a method and a device for generating an interference signal of a target voice signal, wherein the method comprises the following steps: acquiring a target voice signal to be interfered; carrying out framing treatment on the target voice signal to obtain at least one voice frame; processing each voice frame, including performing first processing, second processing and/or third processing on the voice frame to obtain a frequency domain envelope inversion signal, a time domain inversion signal and/or a time domain envelope inversion signal; and determining the interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and the preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal. According to the method and the device, the three frequency domain envelope inversion signals, the time domain inversion signals and the time domain envelope inversion signals related to the target voice signals are constructed, and according to the three constructed signals and the preset weight coefficients corresponding to the three constructed signals, the interference signals of the target voice signals are obtained, so that the interference effect of the interference signals on the target voice signals is further improved.

Description

Method and device for generating interference signal of target voice signal

Technical Field

The invention relates to the technical field of voice signal processing. And more particularly, to a method and apparatus for generating an interference signal of a target voice signal.

Background

With the rapid development of the mobile internet, recordable devices around us have shown explosive growth in both variety and number. The ubiquitous recordable equipment brings convenience to us, meanwhile, the problem of voice privacy leakage is gradually serious, and the national security and people life are threatened.

The main goal of voice privacy protection technology is to reduce the intelligibility of user voice signals picked up by recordable devices. Early research focused on how to use physical isolation to protect voice privacy. In recent years, technologies that actively interfere with potential eavesdropping devices have received more attention. The masking effect that speech intelligibility is subjected to under interference conditions falls into two broad categories, energy masking and information masking. The main means of energy masking is mainly to mask the speech with uncorrelated noise. With the development of speech enhancement technology, the energy masked speech is easily recovered. Unlike energy masking, the masking signals used for information masking are all changed from the target voice signal, and have stronger correlation with the target voice signal. However, it is currently difficult to ensure the real-time performance of an interference signal (i.e., a masking signal) of a target voice signal for information masking.

Disclosure of Invention

Because the existing method has the problems, the application provides an interference signal generation method and device of a target voice signal.

In a first aspect, the present application proposes an interference signal generating method for a target voice signal, including:

acquiring a target voice signal to be interfered;

carrying out frame division processing on the target voice signal to obtain at least one voice frame, wherein the frame length of each voice frame in the at least one voice frame is a random value;

processing each of the at least one speech frame, the processing comprising:

performing first processing on each voice frame in the at least one voice frame to obtain a frequency domain envelope inversion signal;

performing second processing on each voice frame in the at least one voice frame to obtain a time domain inversion signal; and/or

Performing third processing on each voice frame in the at least one voice frame to obtain a time domain envelope inversion signal;

and determining the interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal.

In one possible implementation, the first processing of each of the at least one speech frame to obtain a frequency domain envelope inversion signal includes:

performing Fourier transform on each voice frame in the at least one voice frame to obtain a frequency spectrum of each voice frame in the at least one voice frame;

determining a spectral envelope of each of the at least one speech frame from a spectrum of each speech frame;

determining a first fine structure corresponding to each voice frame according to the spectrum envelope of the voice frame;

according to the spectrum envelope of each voice frame, determining N times of polynomial and/or exponential function fitting curves of the spectrum envelope of each voice frame, wherein N is an integer greater than or equal to 1;

and determining the frequency domain envelope inversion signal according to the frequency spectrum envelope of each voice frame, the respective N-degree polynomial or exponential function fitting curve of the frequency spectrum envelope of each voice frame and the first fine structure.

In one possible implementation, the performing a second processing on each of the at least one speech frame to obtain a time domain inverted signal includes:

and carrying out time domain inversion on each voice frame in the at least one voice frame to obtain a time domain inversion signal.

In one possible implementation, the performing third processing on each of the at least one speech frame to obtain a time-domain envelope inversion signal includes:

carrying out frequency band division on each voice frame in the at least one voice frame to obtain partial target voice signals corresponding to each frequency band;

determining the time domain envelope of the target voice signals according to the partial target voice signals corresponding to the frequency bands;

determining a second fine structure corresponding to the time domain envelope of the partial target voice signal corresponding to each frequency band according to the time domain envelope;

and determining a time domain envelope inversion signal according to the time domain envelope of the part of the target voice signals corresponding to each frequency band and the second fine structure.

In one possible implementation, the determining the interference signal of the target speech signal according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and the preset weight coefficients corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal, respectively, includes:

determining a weighting signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weighting coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal;

and determining an interference signal of the target voice signal according to the target voice signal and the weighted signal.

In a possible implementation, the determining the weighted signal according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and the preset weight coefficients corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal respectively includes:

multiplying the frequency domain envelope inversion signal, the time domain inversion signal and the time domain envelope inversion signal with preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal to obtain at least one multiplication result;

and accumulating the at least one multiplication result to obtain a weighted signal.

In one possible implementation, the determining the interference signal of the target speech signal according to the target speech signal and the weighted signal includes:

and carrying out low-pass filtering on the weighted signals by adopting a low-pass filter to obtain low-pass interference signals. In order to reduce the possibility that the target voice signal after being interfered by the generated interference signal is restored, the cut-off frequency of the low-pass filter is randomly set;

and replacing a part of low-pass interference signals corresponding to any frequency band with a part of target voice signals corresponding to the any frequency band to obtain interference signals of the target voice signals.

In a second aspect, the present application proposes an interference signal generating device for a target voice signal, including:

the receiving and transmitting unit is used for acquiring a target voice signal to be interfered;

the processing unit is used for carrying out frame division processing on the target voice signal to obtain at least one voice frame, and the frame length of each voice frame in the at least one voice frame is a random value;

the processing unit is configured to process each of the at least one speech frame, where the processing includes:

the processing unit is used for determining the interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal.

In a possible implementation, the processing unit is specifically configured to perform fourier transform on each of the at least one speech frame to obtain a frequency spectrum of each of the at least one speech frame; determining a spectral envelope of each of the at least one speech frame from a spectrum of each speech frame; determining a first fine structure corresponding to each voice frame according to the spectrum envelope of the voice frame; according to the spectrum envelope of each voice frame, determining N times of polynomial and/or exponential function fitting curves of the spectrum envelope of each voice frame, wherein N is an integer greater than or equal to 1; and determining the frequency domain envelope inversion signal according to the frequency spectrum envelope of each voice frame, the respective N-degree polynomial or exponential function fitting curve of the frequency spectrum envelope of each voice frame and the first fine structure.

In a possible implementation, the processing unit is specifically configured to perform time domain inversion on each of the at least one speech frame to obtain a time domain inverted signal.

In one possible implementation, the processing unit is specifically configured to divide a frequency band for each of the at least one speech frame to obtain a portion of the target speech signal corresponding to each frequency band; determining the time domain envelope of the target voice signals according to the partial target voice signals corresponding to the frequency bands; determining a second fine structure corresponding to the time domain envelope of the partial target voice signal corresponding to each frequency band according to the time domain envelope; and determining a time domain envelope inversion signal according to the time domain envelope of the part of the target voice signals corresponding to each frequency band and the second fine structure.

In a possible implementation, the processing unit is specifically configured to determine a weighted signal according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and preset weight coefficients corresponding to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal, respectively; and determining an interference signal of the target voice signal according to the target voice signal and the weighted signal.

In a possible implementation, the processing unit is specifically configured to multiply the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal with preset weight coefficients corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal, respectively, to obtain at least one multiplication result; and accumulating the at least one multiplication result to obtain a weighted signal.

In a possible implementation, the processing unit is specifically configured to perform low-pass filtering on the weighted signal to obtain a low-pass interference signal; and replacing a part of low-pass interference signals corresponding to any frequency band with a part of target voice signals corresponding to the any frequency band to obtain interference signals of the target voice signals.

In a third aspect, the present application also proposes an interfering signal generating device of a target speech signal, comprising at least one processor for executing a program stored in a memory, which when executed causes the device to perform the steps as in the first aspect and in various possible implementations.

In a fourth aspect, the present application also proposes a non-transitory computer readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the steps as in the first aspect and in various possible implementations.

According to the technical scheme, the three frequency domain envelope inversion signals, the time domain inversion signals and the time domain envelope inversion signals related to the target voice signals are constructed, and the three signals are dynamically weighted to obtain weighted signals, so that the interference effect of the weighted signals on the target voice signals is stronger than that of the three signals, and the interference effect on the target voice signals is improved. And the weighted signals are subjected to low-pass filtering to obtain low-pass interference signals, and partial low-pass interference signals corresponding to any frequency band are replaced by partial target voice signals corresponding to any frequency band to obtain interference signals of the target voice signals, so that the interference effect of the interference signals on the target voice signals is further improved. In addition, the generation process of the three signals is random, so that the possibility that the target voice signal is restored after being interfered by the interference signal is further reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of an interference signal generating method of a target voice signal according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a first process performed on each of at least one voice frame according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a third process performed on each of at least one speech frame according to an embodiment of the present application;

fig. 4 is a schematic flow chart of determining an interference signal of a target speech signal according to a frequency domain envelope inversion signal, a time domain envelope inversion signal and preset weight coefficients corresponding to the frequency domain envelope inversion signal, the time domain envelope inversion signal and the time domain envelope inversion signal respectively provided in the embodiment of the present application;

fig. 5 is a schematic structural diagram of an interference signal generating device of a target voice signal according to an embodiment of the present application;

fig. 6 is another schematic structural diagram of an interference signal generating device for a target voice signal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

It should be noted that, in this application, the term "and/or" is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. The terms first, second, third and the like in the description and in the claims of embodiments of the present application are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first process, the second process, the third process, and the like are for distinguishing between different processes, and are not for describing a specific order of the target object. In the embodiments of the present application, words such as "exemplary," "for example," or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "by way of example," or "such as" is not necessarily to be construed as advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion. In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more.

In order to prevent voice privacy disclosure, for example, a person outside a conference room may hear the speaking content of the person inside the conference room, the embodiment of the application provides a method and a device for generating an interference signal of a target voice signal. The above method generates an interference signal of the target speech signal. The target voice signal is interfered by the interference signal, so that eavesdropping can be effectively prevented. For example, the interference signal is played outside the conference room, so that a person outside the conference room cannot accurately recognize the target voice signal, thereby achieving the purpose of preventing eavesdropping.

Fig. 1 is a flow chart of an interference signal generating method of a target voice signal provided in the present application, where the flow chart includes: S101-S106, specifically include:

s101, acquiring a target voice signal to be interfered.

In the embodiment of the application, a target voice signal to be interfered is obtained.

S102, carrying out framing processing on the target voice signal to obtain at least one voice frame.

In the embodiment of the application, framing processing is performed on the target voice signal to obtain at least one voice frame. In order to reduce the possibility that the target voice signal after being interfered by the generated interference signal is restored, the frame length of each voice frame in the at least one voice frame is a random value. For example, the target speech signal is subjected to frame division processing to obtain three speech frames, wherein the frame length of the first speech frame is 3 frames, the frame length of the second speech frame is 6 frames, and the frame length of the third speech frame is 8 frames.

S103, processing each voice frame in the at least one voice frame, wherein the processing comprises the following steps: performing first processing on each voice frame in the at least one voice frame to obtain a frequency domain envelope inversion signal; performing second processing on each voice frame in the at least one voice frame to obtain a time domain inversion signal; and/or performing third processing on each voice frame in the at least one voice frame to obtain a time domain envelope inversion signal.

In an embodiment of the present application, a first process is performed on each of at least one speech frame. The first processing procedure comprises S201-S205, and specifically comprises the following steps:

s201, performing Fourier transform on each voice frame in the at least one voice frame to obtain a frequency spectrum of each voice frame in the at least one voice frame.

S202, determining the spectrum envelope of each voice frame according to the spectrum of each voice frame in at least one voice frame;

s203, determining a first fine structure corresponding to each voice frame according to the spectrum envelope of the voice frame;

s204, according to the spectrum envelope of each voice frame, determining N times of polynomials and/or exponential function fitting curves of the spectrum envelope of each voice frame;

in this embodiment of the present application, a least square method may be used to determine, according to the spectral envelope of each speech frame, a polynomial and/or exponential function fitting curve of each of the respective N-th order of the spectral envelope of each speech frame, where N is an integer greater than or equal to 1.

S205, determining the frequency domain envelope inversion signal according to the spectrum envelope of each voice frame, the respective N times polynomial or exponential function fitting curve of the spectrum envelope of each voice frame and the first fine structure.

In the embodiment of the application, the spectrum envelope of each voice frame is inverted by taking one of the respective polynomial or exponential function fitting curves of degree N as the symmetry axis. And determining the frequency spectrum of each voice frame of the frequency domain envelope inversion signal according to the inverted frequency spectrum envelope and the first fine structure. And carrying out inverse Fourier transform on the frequency spectrum of each voice frame of the frequency domain envelope inversion signal to obtain the frequency domain envelope inversion signal. To this end, a first speech intelligibility gap signal is generated that is related to the target speech signal.

In the embodiment of the present application, to obtain the time domain inversion signal, a second processing needs to be performed on each of the at least one speech frame. In particular, the second processing may be time domain inversion of each of the at least one speech frame. To this end, a second speech intelligibility gap signal is generated that is related to the target speech signal.

In an embodiment of the present application, a third process is performed on each of the at least one speech frame, where the third process includes: S301-S304 specifically comprise:

s301, carrying out frequency band division on each voice frame in at least one voice frame to obtain partial target voice signals corresponding to each frequency band.

In the embodiment of the application, a band-pass filter is used to divide the frequency band of each voice frame in at least one voice frame, so as to obtain a part of target voice signals corresponding to each frequency band.

S302, determining the time domain envelope of the target voice signals according to the partial target voice signals corresponding to the frequency bands.

S303, determining a second fine structure corresponding to the time domain envelope of the partial target voice signal corresponding to each frequency band.

S304, determining a time domain envelope inversion signal according to the time domain envelope and the second fine structure of the part of the target voice signal corresponding to each frequency band.

In the embodiment of the application, the time domain envelope of the partial target voice signal corresponding to each frequency band is subjected to time reversal. And determining a time domain envelope inversion signal according to the inverted time domain envelope and the second fine structure. To this end, a third speech intelligibility gap signal is generated that is related to the target speech signal.

S104, determining the interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal.

In this embodiment of the present application, determining, according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and preset weight coefficients corresponding to the frequency domain envelope inversion signal, the interference signal of the target speech signal includes: S401-S402 specifically comprise:

s401, determining a weighting signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weighting coefficients corresponding to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal respectively.

In the embodiment of the application, the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal are multiplied by preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal to obtain at least one multiplication result. And accumulating at least one multiplication result to obtain a weighted signal.

S402, determining an interference signal of the target voice signal according to the target voice signal and the weighted signal.

In the embodiment of the application, since the interference effect of the weighted signal with the frequency of more than 4kHZ on the target voice signal is weak, the weighted signal is subjected to low-pass filtering by adopting a low-pass filter, so that a low-pass interference signal is obtained. The cut-off frequency of the low-pass filter is set randomly in order to reduce the possibility that the target voice signal after being interfered by the generated interference signal is restored. And then, replacing the part of low-pass interference signals corresponding to the optional frequency band with the part of target voice signals corresponding to the optional frequency band to obtain the interference signals of the target voice signals. The interference signal has strong correlation with the target voice signal, and the possibility that the target voice signal interfered by the interference signal is restored is very small, so that the target voice signal is prevented from being eavesdropped.

According to the method and the device, three frequency domain envelope inversion signals, time domain inversion signals and time domain envelope inversion signals related to the target voice signals are constructed, namely, a first voice intelligibility interference signal, a second voice intelligibility interference signal and a third voice intelligibility interference signal, and the three voice intelligibility interference signals are dynamically weighted to obtain weighted signals, and the interference effect of the weighted signals on the target voice signals is stronger than that of the three voice intelligibility interference signals, so that the interference effect on the target voice signals is improved. And the weighted signals are subjected to low-pass filtering to obtain low-pass interference signals, and partial low-pass interference signals corresponding to any frequency band are replaced by partial target voice signals corresponding to any frequency band to obtain interference signals of the target voice signals, so that the interference effect of the interference signals on the target voice signals is further improved. In addition, the generation process of the three voice intelligibility interference signals is random, so that the possibility that the target voice signals interfered by the interference signals are restored is further reduced.

Fig. 5 is a schematic structural diagram 500 of an interference signal generating device for a target voice signal provided in the present application, where the schematic structural diagram 500 includes: a transceiver unit 501 and a processing unit 502;

the transceiver 501 is configured to obtain a target voice signal to be interfered;

the processing unit 502 is configured to perform frame segmentation processing on the target voice signal to obtain at least one voice frame, where a frame length of each voice frame in the at least one voice frame is a random value;

the processing unit 502 is configured to process each of the at least one speech frame, where the processing includes:

the processing unit 502 is configured to determine an interference signal of the target speech signal according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and preset weight coefficients corresponding to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal, respectively.

In a possible implementation, the processing unit 502 is specifically configured to perform fourier transform on each of the at least one speech frame to obtain a spectrum of each of the at least one speech frame; determining a spectral envelope of each of the at least one speech frame from a spectrum of each speech frame; determining a first fine structure corresponding to each voice frame according to the spectrum envelope of the voice frame; according to the spectrum envelope of each voice frame, determining N times of polynomial and/or exponential function fitting curves of the spectrum envelope of each voice frame, wherein N is an integer greater than or equal to 1; and determining the frequency domain envelope inversion signal according to the frequency spectrum envelope of each voice frame, the respective N-degree polynomial or exponential function fitting curve of the frequency spectrum envelope of each voice frame and the first fine structure.

In a possible implementation, the processing unit 502 is specifically configured to perform time domain inversion on each of the at least one speech frame to obtain a time domain inverted signal.

In a possible implementation, the processing unit 502 is specifically configured to divide a frequency band for each of the at least one speech frame to obtain a portion of the target speech signal corresponding to each frequency band; determining the time domain envelope of the target voice signals according to the partial target voice signals corresponding to the frequency bands; determining a second fine structure corresponding to the time domain envelope of the partial target voice signal corresponding to each frequency band according to the time domain envelope; and determining a time domain envelope inversion signal according to the time domain envelope of the part of the target voice signals corresponding to each frequency band and the second fine structure.

In a possible implementation, the processing unit 502 is specifically configured to determine a weighted signal according to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal and preset weight coefficients corresponding to the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal, respectively; and determining an interference signal of the target voice signal according to the target voice signal and the weighted signal.

In a possible implementation, the processing unit 502 is specifically configured to multiply the frequency domain envelope inversion signal, the time domain inversion signal, and/or the time domain envelope inversion signal with preset weight coefficients corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal, respectively, to obtain at least one multiplication result; and accumulating the at least one multiplication result to obtain a weighted signal.

In a possible implementation, the processing unit 502 is specifically configured to perform low-pass filtering on the weighted signal to obtain a low-pass interference signal; and replacing a part of low-pass interference signals corresponding to any frequency band with a part of target voice signals corresponding to the any frequency band to obtain interference signals of the target voice signals.

Fig. 6 is a schematic structural diagram 600 of an interference signal generating device for a target voice signal according to an embodiment of the present application. The apparatus 600 may be a system-on-chip. In the embodiment of the application, the chip system may be formed by a chip, and may also include a chip and other discrete devices. The apparatus 600 includes at least one processor 610 for implementing the methods provided by embodiments of the present application. The apparatus 600 may also include a communication interface 620. In the present embodiment, the communication interface 620 may be a transceiver, a circuit, a bus, a module, or other type of communication interface for communicating with other devices over a transmission medium.

Processor 610 may perform functions performed by processing unit 502 in apparatus 500; the communication interface 620 may be used to perform functions performed by the transceiving unit 501 in the apparatus 500.

When the apparatus 600 is configured to perform the above method, the communication interface 620 is configured to acquire a target voice signal to be interfered; the processor 610 is configured to perform frame segmentation processing on the target voice signal to obtain at least one voice frame, where a frame length of each voice frame in the at least one voice frame is a random value; processing each of the at least one speech frame, the processing comprising: performing first processing on each voice frame in the at least one voice frame to obtain a frequency domain envelope inversion signal; performing second processing on each voice frame in the at least one voice frame to obtain a time domain inversion signal; and/or performing third processing on each voice frame in the at least one voice frame to obtain a time domain envelope inversion signal; and determining the interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal.

The communication interface 620 is also used to perform other steps or operations in the above-described method embodiments in addition to the transceiver unit 501. The processor 610 may also be configured to perform other steps or operations in the above-described method embodiments other than the processing unit 502, which are not described in detail herein.

The apparatus 600 may also include at least one memory 630 for storing program instructions and/or data. The memory 630 is coupled to the processor 610. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units, or modules, which may be in electrical, mechanical, or other forms for information interaction between the devices, units, or modules. The processor 610 may operate in conjunction with the memory 630. Processor 610 may execute program instructions stored in memory 630. In one possible implementation, at least one of the at least one memory may be integrated with the processor. In another possible implementation, the memory 630 is located outside of the device 600.

The particular connection medium between communication interface 620, processor 610, and memory 630 is not limited in this embodiment. In the embodiment of the present application, the memory 630, the processor 610 and the communication interface 620 are connected by a bus 640 in fig. 6, where the bus is indicated by a thick line in fig. 6, and the connection manner between other components is only schematically illustrated, and is not limited thereto. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

The processor 610 may be one or more central processing units (Central Processing Unit, CPU) by way of example, and in the case where the processor 610 is a CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor 410 may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, and may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

By way of example, memory 630 may include, but is not limited to, nonvolatile Memory such as Hard Disk Drive (HDD) or Solid State Drive (SSD), random access Memory (Random Access Memory, RAM), erasable programmable Read-Only Memory (Erasable Programmable ROM, EPROM), read-Only Memory (ROM), or portable Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), among others. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory in the embodiments of the present application may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data. .

The embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring a target voice signal to be interfered;

processing each of the at least one speech frame, the processing comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for generating an interference signal of a target speech signal, comprising:

acquiring a target voice signal to be interfered;

processing each of the at least one speech frame, the processing comprising:

determining an interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal;

wherein the performing a first process on each of the at least one speech frame to obtain a frequency domain envelope inversion signal includes:

determining the frequency domain envelope inversion signal according to the spectral envelope of each voice frame, the respective polynomial or exponential function fitting curve of the spectral envelope of each voice frame and the first fine structure;

the third processing is performed on each voice frame in the at least one voice frame to obtain a time domain envelope inversion signal, including:

2. The method of claim 1, wherein said second processing each of said at least one speech frame to obtain a time domain inverted signal comprises:

3. The method according to claim 1, wherein the determining the interference signal of the target speech signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and the preset weight coefficients corresponding thereto, respectively, comprises:

4. A method according to claim 3, wherein said determining a weighted signal from said frequency domain envelope inversion signal, said time domain inversion signal and/or said time domain envelope inversion signal and their respective corresponding preset weight coefficients comprises:

multiplying the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal with preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal and the time domain envelope inversion signal to obtain at least one multiplication result;

5. A method according to claim 3, wherein said determining an interference signal of said target speech signal from said target speech signal and said weighted signal comprises:

carrying out low-pass filtering on the weighted signals to obtain low-pass interference signals;

6. An interference signal generating apparatus for a target speech signal, comprising:

the processing unit is configured to process each of the at least one speech frame, and the processing includes:

the processing unit is used for determining an interference signal of the target voice signal according to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal and preset weight coefficients respectively corresponding to the frequency domain envelope inversion signal, the time domain inversion signal and/or the time domain envelope inversion signal;

the processing unit is specifically configured to perform fourier transform on each of the at least one speech frame, so as to obtain a frequency spectrum of each of the at least one speech frame; determining a spectral envelope of each of the at least one speech frame from a spectrum of each speech frame; determining a first fine structure corresponding to each voice frame according to the spectrum envelope of the voice frame; according to the spectrum envelope of each voice frame, determining N times of polynomial and/or exponential function fitting curves of the spectrum envelope of each voice frame, wherein N is an integer greater than or equal to 1; determining the frequency domain envelope inversion signal according to the spectral envelope of each voice frame, the respective polynomial or exponential function fitting curve of the spectral envelope of each voice frame and the first fine structure;

the processing unit is specifically configured to divide a frequency band for each voice frame in the at least one voice frame, so as to obtain a part of target voice signals corresponding to each frequency band; determining the time domain envelope of the target voice signals according to the partial target voice signals corresponding to the frequency bands; determining a second fine structure corresponding to the time domain envelope of the partial target voice signal corresponding to each frequency band according to the time domain envelope; and determining a time domain envelope inversion signal according to the time domain envelope of the part of the target voice signals corresponding to each frequency band and the second fine structure.

7. An interfering signal generating device for a target speech signal, comprising at least one processor for executing a program stored in a memory, which when executed, causes the device to perform:

the method of any one of claims 1-5.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-5.