CN117095689A

CN117095689A - Digital signal data denoising method and system

Info

Publication number: CN117095689A
Application number: CN202311000192.1A
Authority: CN
Inventors: 胡艳峰; 匡建军; 李晓飞; 阮腾达; 邱敏超; 薛华坤; 谢智清; 李宏露
Original assignee: Xiamen Jiaotest Intelligent Technology Co ltd; CCCC First Highway Xiamen Engineering Co Ltd
Current assignee: Xiamen Jiaotest Intelligent Technology Co ltd; CCCC First Highway Xiamen Engineering Co Ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-11-21

Abstract

The invention provides a digital signal data denoising method and system, and relates to the technical field of artificial intelligence. In the invention, a first audio information characteristic representation of first audio digitized data is mined; mining a second audio information characteristic representation of the second audio digitized data; constructing a correlated data combination and a non-correlated data combination based on the first audio information feature representation and the second audio information feature representation; forming a first target audio mining network and a second target audio mining network based on the correlated data combination and the uncorrelated data combination; mining a first target audio feature representation and a second target audio feature representation using the first target audio mining network and the second target audio mining network; and outputting the denoising audio digitized data based on the first target audio feature representation and the second target audio feature representation by using the target data denoising network. Based on the above, the reliability of the denoising process can be improved.

Description

Digital signal data denoising method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a digital signal data denoising method and system.

Background

In a computer, the magnitude of a digital signal is often represented by a binary number with a limit. Since digital signals are represented by two physical states, 0 and 1, they are much more resistant to interference than analog signals. In signal processing in modern technology, digital signals play a larger and larger role, and almost complex signal processing is not separated from the digital signals; alternatively, as long as the method of solving the problem can be expressed by a mathematical formula, a digital signal representing the physical quantity can be processed by a computer.

Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique and application system that simulates, extends and extends human intelligence, senses environment, obtains knowledge and uses knowledge to obtain optimal results using digital computers or digital computer controlled computations.

The noise removal processing is performed on the audio digitized data by using an artificial intelligence technology, so that the reliability of the noise removal processing can be improved to a certain extent, but the problem of low reliability still exists.

Disclosure of Invention

Accordingly, the present invention is directed to a method and a system for denoising digitized signal data, so as to improve the reliability of denoising process to a certain extent.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme:

a method of denoising digitized signal data, comprising:

extracting a first number of representative data combinations, each of the representative data combinations comprising semantically related first audio digitized data and second audio digitized data, the second audio digitized data having noise data;

for each typical data combination, mining a first audio information feature representation corresponding to the first audio digitized data by using a first initial audio mining network; and for each of the typical data combinations, mining a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network;

constructing at least one correlated data combination and at least one uncorrelated data combination based on the corresponding first and second audio information feature representations of each of the typical data combinations, each of the correlated data combinations comprising the first and second audio information feature representations belonging to the same typical data combination, each of the uncorrelated data combinations comprising the first and second audio information feature representations belonging to different typical data combinations;

Based on the related data combination and the non-related data combination, performing optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network to form a first target audio mining network and a second target audio mining network;

respectively mining different data fragments in the target audio digitized data by utilizing the first target audio mining network and the second target audio mining network to mine out a first target audio feature representation and a second target audio feature representation, wherein the data fragments corresponding to the second target audio feature representation belong to fragments with noise data in the target audio digitized data;

and denoising processing is carried out based on the first target audio feature representation and the second target audio feature representation by utilizing a target data denoising network so as to output denoising audio digitized data corresponding to the target audio digitized data, wherein the target data denoising network is formed by carrying out network optimization operation on an initial data denoising network based on a first audio information feature representation and a second audio information feature representation respectively mined by the first target audio mining network and the second target audio mining network and combining the noiseless audio digitized data serving as labels.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of extracting a first number of typical data combinations includes:

extracting a first number of original audio digitized data, and respectively carrying out noise coarse positioning operation on each original audio digitized data to output a corresponding noise coarse positioning result;

based on the noise coarse positioning result, respectively carrying out segmentation operation on each corresponding original audio digitized data to form first audio digitized data and second audio digitized data which are related in semantic meaning, wherein the second audio digitized data has noise data;

the semantically related first audio digitized data and the second audio digitized data corresponding to the same original audio digitized data are combined to form a typical data combination.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of mining, for each of the typical data combinations, a first audio information feature representation corresponding to the first audio digitized data using a first initial audio mining network includes:

Splitting and combining the first audio digitized data to form a typical first ordered set, wherein each typical first segment feature included in the typical first ordered set represents that a corresponding typical first digitized data segment corresponds to one audio frame in first audio corresponding to the first audio digitized data;

according to the typical first ordered set, utilizing the first initial audio mining network to mine out a first audio information characteristic representation corresponding to the first audio digital data;

the step of mining out a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network for each of the representative data combinations, comprising:

splitting and combining the second audio digitized data to form a typical second ordered set, wherein each typical second segment feature included in the typical second ordered set represents that a corresponding typical second digitized data segment corresponds to one audio frame in second audio corresponding to the second audio digitized data;

and mining a second audio information characteristic representation corresponding to the second audio digital data by utilizing the second initial audio mining network according to the typical second ordered set.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of splitting and combining the first audio digitized data to form a typical first ordered set includes:

splitting the first audio digital data to form at least one typical first digital data segment corresponding to the first audio digital data;

performing a feature space mapping operation on each of the at least one representative first digitized data segment to form a corresponding at least one representative first segment feature representation, the representative first segment feature representation corresponding to the representative first digitized data segment;

a representative first ordered set is constructed based on the at least one representative first segment feature representation.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of splitting and combining the second audio digitized data to form a typical second ordered set includes:

determining the first data segment number of typical first digitized data segments corresponding to the first audio digitized data;

Performing a split-combining operation in the second audio digitized data based on the first number of data segments to form a representative second data segment cluster comprising a representative second number of digitized data segments equal to the first number of data segments;

performing feature space mapping operation on each representative second digitized data segment in the representative second data segment cluster to form at least one representative second segment feature representation to be determined, the representative second segment feature representation to be determined corresponding to the representative second digitized data segment;

performing a noise tagging operation on each of the at least one pending representative second segment feature representation to form at least one tagged representative second segment feature representation, the tagged representative second segment feature representation corresponding to the pending representative second segment feature representation, and the tagged representative second segment feature representation having an identification that characterizes whether noise is carried;

performing a noise-signature embedding operation on each of the at least one tagged representative second segment feature representation to form a corresponding at least one representative second segment feature representation, the representative second segment feature representation corresponding to the tagged representative second segment feature representation;

A representative second ordered set is constructed based on the at least one representative second segment feature representation.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of performing a split combining operation in the second audio digitized data based on the first number of data segments to form a typical second cluster of data segments includes:

performing audio evaluation operation on each typical second digitized data segment in the second audio digitized data, and outputting an audio evaluation parameter corresponding to each typical second digitized data segment, wherein the audio evaluation parameter is used for reflecting the noise amount of the typical second digitized data segment;

and screening typical second digitized data fragments in the second audio digitized data based on the order of the audio evaluation parameters from large to small, and constructing typical second data fragment clusters based on the screened typical second digitized data fragments with the number equal to that of the first data fragments.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of mining, according to the typical first ordered set, a first audio information feature representation corresponding to the first audio digitized data using the first initial audio mining network includes:

According to the typical first ordered set, a first depth mining unit included in the first initial audio mining network is utilized to mine out a corresponding first audio depth feature representation;

according to the first audio depth characteristic representation, a corresponding first audio information characteristic representation is analyzed by using a first fully-connected processing unit included in the first initial audio mining network;

and the step of mining out a second audio information feature representation corresponding to the second audio digitized data using the second initial audio mining network according to the representative second ordered set, comprising:

digging a corresponding second audio depth feature representation by using a second depth digging unit included in the second initial audio digging network according to the typical second ordered set, wherein the second depth digging unit is different from the first depth digging unit;

and analyzing a corresponding second audio information characteristic representation by using a second fully-connected processing unit included in the second initial audio mining network according to the second audio depth characteristic representation.

In some preferred embodiments, in the above-described digital signal data denoising method, the first number of typical data combinations includes a first typical data combination including semantically related first audio digitized data to be processed and second audio digitized data to be processed and a second typical data combination including semantically related first audio digitized data to be analyzed and second audio digitized data to be analyzed;

The step of mining out a first audio information feature representation corresponding to the first audio digitized data using a first initial audio mining network for each of the representative data combinations, comprises:

digging out a first audio information characteristic representation to be processed corresponding to the first audio digital data to be processed by using the first initial audio mining network, and digging out a first audio information characteristic representation to be analyzed corresponding to the first audio digital data to be analyzed by using the first initial audio mining network;

digging out a second audio information characteristic representation to be processed corresponding to the second audio digitized data to be processed by using the second initial audio mining network, and digging out a second audio information characteristic representation to be analyzed corresponding to the second audio digitized data to be analyzed by using the second initial audio mining network;

the step of constructing at least one correlated data combination and at least one uncorrelated data combination based on the first audio information feature representation and the second audio information feature representation corresponding to each of the representative data combinations, comprises:

Performing a combination operation on the second audio information feature representation to be processed and the first audio information feature representation to be processed to form a corresponding one of the relevant data combinations, and performing a combination operation on the second audio information feature representation to be analyzed and the first audio information feature representation to be analyzed to form a corresponding one of the relevant data combinations;

and performing a combination operation on the second audio information feature representation to be processed and the first audio information feature representation to be analyzed to form a corresponding one of the non-correlated data combinations, and performing a combination operation on the second audio information feature representation to be analyzed and the first audio information feature representation to be processed to form a corresponding one of the non-correlated data combinations.

In some preferred embodiments, in the above method for denoising digitized signal data, the step of denoising, using a target data denoising network, based on the first target audio feature representation and the second target audio feature representation, to output denoised audio digitized data corresponding to the target audio digitized data includes:

performing focus feature analysis operation on the second target audio feature representation based on the first target audio feature representation by using a focus fusion unit included in a target data denoising network so as to output a focus audio feature representation corresponding to the second target audio feature representation;

Performing feature restoration operation on the focused audio feature representation by using a feature restoration unit included in the target data denoising network so as to output a restored audio digitized data fragment corresponding to the second target audio feature representation;

and replacing the data segment corresponding to the second target audio feature representation in the target audio digitized data with the restored audio digitized data segment to form denoising audio digitized data corresponding to the target audio digitized data.

The embodiment of the invention also provides a digital signal data denoising system, which comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program so as to realize the digital signal data denoising method.

The denoising method and system for the digitized signal data provided by the embodiment of the invention can be used for firstly mining out the first audio information characteristic representation of the first audio digitized data; mining a second audio information characteristic representation of the second audio digitized data; constructing a correlated data combination and a non-correlated data combination based on the first audio information feature representation and the second audio information feature representation; forming a first target audio mining network and a second target audio mining network based on the correlated data combination and the uncorrelated data combination; mining a first target audio feature representation and a second target audio feature representation using the first target audio mining network and the second target audio mining network; and outputting the denoising audio digitized data based on the first target audio feature representation and the second target audio feature representation by using the target data denoising network. Based on the above, the first target audio feature representation and the second target audio feature representation are respectively mined by using the first target audio mining network and the second target audio mining network which are formed by network optimization, so that the mining precision is higher, and therefore, the first target audio feature representation and the second target audio feature representation are used as the basis of the denoising audio digitized data, namely, the precision of the first target audio feature representation and the second target audio feature representation is higher, so that the reliability of the obtained denoising audio digitized data is higher, the reliability of denoising processing is improved to a certain extent, and the problem of low reliability in the prior art is solved.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of a digital signal data denoising system according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating steps involved in a method for denoising digitized signal data according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of each module included in the denoising apparatus for digitized signal data according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an embodiment of the present invention provides a digital signal data denoising system. Wherein the digitized signal data denoising system can comprise a memory and a processor.

In detail, the memory and the processor are electrically connected directly or indirectly to realize transmission or interaction of data. For example, electrical connection may be made to each other via one or more communication buses or signal lines. The memory may store at least one software functional module (computer program) that may exist in the form of software or firmware. The processor may be configured to execute an executable computer program stored in the memory, so as to implement the denoising method for digitized signal data provided by the embodiment of the present invention.

It should be appreciated that in some possible embodiments, the Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), and the like.

It should be appreciated that in some possible embodiments, the processor may be a general purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a System on Chip (SoC), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

It should be appreciated that in some possible embodiments, the digitized signal data denoising system may be a server with data processing capabilities.

Referring to fig. 2, an embodiment of the present invention further provides a denoising method for digitized signal data, which can be applied to the denoising system for digitized signal data. The method steps defined by the flow related to the digital signal data denoising method can be realized by the digital signal data denoising system.

The specific flow shown in fig. 2 will be described in detail.

In step S110, a first number of typical data combinations is extracted.

In an embodiment of the present invention, the digitized signal data denoising system may extract a first number of typical data combinations. Each of the exemplary data combinations includes semantically related first audio digitized data and second audio digitized data, the second audio digitized data having noise data, the first audio digitized data may, for example, have no noise data or a small amount of noise data, not being targeted for noise removal.

Step S120, for each of the typical data combinations, mining a first audio information feature representation corresponding to the first audio digitized data by using a first initial audio mining network; and mining a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network for each of the representative data combinations.

In the embodiment of the invention, the denoising system for the digitized signal data can excavate the first audio information characteristic representation corresponding to the first audio digitized data by utilizing a first initial audio excavation network for excavating characteristic or key information for each typical data combination; and mining a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network for each of the representative data combinations.

Step S130, constructing at least one correlated data combination and at least one uncorrelated data combination based on the first audio information feature representation and the second audio information feature representation corresponding to each of the typical data combinations.

In an embodiment of the present invention, the digitized signal data denoising system may construct at least one correlated data combination and at least one uncorrelated data combination based on the first audio information feature representation and the second audio information feature representation corresponding to each of the typical data combinations. Each of said correlated data combinations comprises a first audio information feature representation and a second audio information feature representation belonging to the same typical data combination, and each of said non-correlated data combinations comprises a first audio information feature representation and a second audio information feature representation belonging to different typical data combinations.

Step S140, performing an optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network based on the related data combination and the non-related data combination, so as to form a first target audio mining network and a second target audio mining network.

In the embodiment of the invention, the digital signal data denoising system can perform optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network based on the related data combination and the non-related data combination so as to form a first target audio mining network and a second target audio mining network. For example, in performing the optimization tuning operation, the distance between the first audio information feature representation and the second audio information feature representation included in the correlated data combination may be reduced, and the distance between the first audio information feature representation and the second audio information feature representation included in the uncorrelated data combination may be increased.

Step S150, utilizing the first target audio mining network and the second target audio mining network to respectively mine different data segments in the target audio digitized data, so as to mine out a first target audio feature representation and a second target audio feature representation.

In the embodiment of the invention, the digital signal data denoising system can utilize the first target audio mining network and the second target audio mining network to respectively mine different data fragments in the target audio digital data so as to mine out a first target audio feature representation and a second target audio feature representation. The second target audio feature indicates that the corresponding data segment belongs to a segment with noise data in the target audio digitized data.

Step S160, performing denoising processing based on the first target audio feature representation and the second target audio feature representation by using a target data denoising network, so as to output denoising audio digitized data corresponding to the target audio digitized data.

In the embodiment of the invention, the denoising system for the digitized signal data can utilize a target data denoising network to perform denoising processing based on the first target audio feature representation and the second target audio feature representation so as to output denoised audio digitized data corresponding to the target audio digitized data. The target data denoising network is formed by performing network optimization operation on an initial data denoising network based on a first audio information characteristic representation and a second audio information characteristic representation respectively mined by the first target audio mining network and the second target audio mining network and combining noise-free audio digital data serving as labels. The initial data denoising network can be performed together with network optimization operations of the first initial audio mining network and the second initial audio mining network, or can be performed separately. For example, in the process of performing the network optimization operation together, the typical data combination may further include noiseless audio digitized data corresponding to the second audio digitized data, so that difference calculation may be performed on the denoised audio digitized data and the noiseless audio digitized data that are subjected to denoising processing, that is, reduction processing, to obtain corresponding error indexes, and then, based on a direction of reducing the error indexes, network parameters of the neural network are optimally adjusted.

Based on the above, the first target audio feature representation and the second target audio feature representation are respectively mined by using the first target audio mining network and the second target audio mining network which are formed by network optimization, so that the mining precision is higher, and therefore, the first target audio feature representation and the second target audio feature representation are used as the basis of the denoising audio digitized data, namely, the precision of the first target audio feature representation and the second target audio feature representation is higher, so that the reliability of the obtained denoising audio digitized data is higher, the reliability of denoising processing is improved to a certain extent, and the problem of low reliability in the prior art is solved.

It should be appreciated that, in some possible embodiments, the step S110 in the implementation content described above, that is, the step of extracting the first number of typical data combinations, may further include the sub-steps described below:

extracting a first number of original audio digitized data, and respectively performing noise coarse positioning operation on each original audio digitized data to output a corresponding noise coarse positioning result, for example, the original audio digitized data can be subjected to noise recognition based on a corresponding neural network formed by network optimization to determine the noise position, or energy change information of the audio in a frequency domain can be calculated to determine whether noise exists or not based on the energy change information, and the method is particularly not limited and can refer to the related prior art;

Dividing each corresponding original audio digitized data based on the noise coarse positioning result to form semantically related first audio digitized data and second audio digitized data, wherein the second audio digitized data has noise data, that is, part of the audio digitized data with the noise data can be divided to form second audio digitized data, and other part of the audio digitized data is used as the first audio digitized data, so that the first audio digitized data can be determined to be noiseless at least based on the noise coarse positioning result;

the semantically related first audio digitized data and the semantically related second audio digitized data corresponding to the same original audio digitized data are combined to form a typical data combination, and thus, a plurality of typical data combinations can be formed for a plurality of the original audio digitized data.

It should be appreciated that, in some possible embodiments, step S120 in the foregoing implementation, that is, the step of mining, for each of the typical data combinations, the first audio information feature representation corresponding to the first audio digitized data using the first initial audio mining network, may further include the substeps described below:

Splitting and combining the first audio digitized data to form a typical first ordered set, wherein each typical first segment feature representation included in the typical first ordered set corresponds to one audio frame in first audio corresponding to the first audio digitized data, namely, the typical first segment feature representations correspond to the audio frames one by one;

and according to the typical first ordered set, performing depth mining by using the first initial audio mining network, and mining out a first audio information characteristic representation corresponding to the first audio digital data.

It should be appreciated that in some possible embodiments, the step of splitting and combining the first audio digitized data to form a typical first ordered set may further include the sub-steps described below:

performing a feature space mapping operation on each of the at least one representative first digitized data segment to form a corresponding at least one representative first segment feature representation corresponding to the representative first digitized data segment, that is, performing a feature space mapping operation on the representative first digitized data segment to represent the representative first digitized data segment in a vector form for facilitating subsequent further processing;

A representative first ordered set is constructed based on the at least one representative first segment feature representation, that is, the representative first ordered set includes representative first segment feature representations arranged in order.

It should be appreciated that in some possible embodiments, the step of mining, using the first initial audio mining network, the first audio information feature representation corresponding to the first audio digitized data according to the exemplary first ordered set further includes the substeps of:

according to the typical first ordered set, a corresponding first audio depth feature representation is mined by using a first depth mining unit included in the first initial audio mining network, wherein the first depth mining unit can be a convolution unit, namely, convolution operation is performed;

and analyzing the corresponding first audio information feature representation by using a first full-connection processing unit included in the first initial audio mining network according to the first audio depth feature representation, namely, performing full-connection processing on the first audio depth feature representation by using the first full-connection processing unit to obtain the corresponding first audio information feature representation.

splitting and combining the second audio digitized data to form a typical second ordered set, wherein each typical second segment feature representation included in the typical second ordered set corresponds to one audio frame in the second audio corresponding to the second audio digitized data, namely, the typical second segment feature representations are in one-to-one correspondence with the audio frames;

and according to the typical second ordered set, performing depth mining by using the second initial audio mining network, and mining out a second audio information characteristic representation corresponding to the second audio digital data.

It should be appreciated that in some possible embodiments, the step of splitting and combining the second audio digitized data to form a typical second ordered set may further include the sub-steps described below:

performing splitting and combining operations in the second audio digitized data based on the first number of data segments to form a typical second data segment cluster, where the typical second data segment cluster includes a typical second digitized data segment equal to the first number of data segments, and illustratively, when determining the second audio digitized data and the first audio digitized data based on the original audio digitized data, the second audio digitized data and the first audio digitized data may include as many audio frames as possible;

performing a feature space mapping operation on each of the representative second digitized data segments in the representative second data segment cluster to form at least one representative second segment feature representation to be determined, the representative second segment feature representation to correspond to the representative second digitized data segment, that is, performing a feature space mapping operation on the representative second digitized data segment to represent the representative second digitized data segment in a vector form;

Performing a noise tagging operation on each of the at least one pending representative second segment feature representation to form at least one tagged representative second segment feature representation, the tagged representative second segment feature representation corresponding (one-to-one) to the pending representative second segment feature representation, and the tagged representative second segment feature representation having an identification that characterizes whether noise is carried, such as a first symbolic representation having noise, a second symbolic representation having no noise, or may be processed based on other means;

performing a noise identification embedding operation on the representative second segment feature representation of each of the at least one representative second segment feature representation to form a corresponding at least one representative second segment feature representation, the representative second segment feature representation corresponding to the representative second segment feature representation of the at least one representative second segment feature representation, e.g., an identification of whether the representation carries noise may be subjected to a feature space mapping operation to obtain a corresponding identification feature representation, and then the identification feature representation may be added to the representative second segment feature representation of the at least one representative second segment feature representation, e.g., by performing an operation such as superposition or cascade combination;

A representative second ordered set is constructed based on the at least one representative second segment feature representation, that is, the representative second ordered set includes representative second segment feature representations.

It should be appreciated that in some possible embodiments, the step of performing a split combining operation in the second audio digitized data based on the first number of data segments to form a typical second cluster of data segments may further comprise the sub-steps described below:

performing audio evaluation operation on each typical second digitized data segment in the second audio digitized data, and outputting an audio evaluation parameter corresponding to each typical second digitized data segment, wherein the audio evaluation parameter is used for reflecting the noise amount of the typical second digitized data segment, and the noise amount can be the ratio between the energy value of noise and the energy value of non-noise;

the typical second digitized data segments in the second audio digitized data are filtered based on the order of the audio evaluation parameters from large to small, and the typical second data segment clusters are constructed based on the filtered typical second digitized data segments with the number equal to the number of the first data segments, that is, the typical second digitized data segments with the first data segments with the smallest audio evaluation parameters can be filtered to construct the typical second data segment clusters, or, in other embodiments, the typical second digitized data segments with the first data segments can be sampled at equal intervals.

It should be appreciated that in some possible embodiments, the step of mining, using the second initial audio mining network, a second audio information feature representation corresponding to the second audio digitized data according to the exemplary second ordered set further includes the substeps of:

according to the typical second ordered set, a second depth mining unit included in the second initial audio mining network is utilized to mine a corresponding second audio depth feature representation, the second depth mining unit is different from the first depth mining unit, the second depth mining unit can be a convolution unit, namely, convolution operation is performed, the number of convolution kernels included in the second depth mining unit is different from the connection mode, and if the number of convolution kernels included in the second depth mining unit is more, the connection mode can include parallel connection, cascade connection and the like;

and analyzing the corresponding second audio information feature representation by using a second full-connection processing unit included in the second initial audio mining network according to the second audio depth feature representation, namely, performing full-connection processing on the second audio depth feature representation by using the second full-connection processing unit to form a second audio information feature representation.

It should be appreciated that in some possible embodiments, the first number of typical data combinations includes a first typical data combination including semantically related first audio digitized data to be processed and second audio digitized data to be processed, and a second typical data combination including semantically related first audio digitized data to be analyzed and second audio digitized data to be analyzed, and the step S120 in the implementation of the foregoing, that is, for each of the typical data combinations, the step of mining, using a first initial audio mining network, a first audio information feature representation corresponding to the first audio digitized data, may further include the substeps described below:

and mining the first audio information characteristic representation to be processed corresponding to the first audio digital data to be processed by using the first initial audio mining network, and mining the first audio information characteristic representation to be analyzed corresponding to the first audio digital data to be analyzed by using the first initial audio mining network.

It should be appreciated that in some possible embodiments, the first number of typical data combinations includes a first typical data combination including semantically related first audio digitized data to be processed and second audio digitized data to be processed, and a second typical data combination including semantically related first audio digitized data to be analyzed and second audio digitized data to be analyzed, based on which step S120 in the implementation described above, that is, the step of mining, for each of the typical data combinations, a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network, may further include the sub-steps of:

And mining the second audio information characteristic representation to be processed corresponding to the second audio digitized data to be processed by using the second initial audio mining network, and mining the second audio information characteristic representation to be analyzed corresponding to the second audio digitized data to be analyzed by using the second initial audio mining network.

It should be appreciated that, in some possible embodiments, step S130 in the foregoing implementation, that is, the step of constructing at least one relevant data combination and at least one irrelevant data combination based on the first audio information feature representation and the second audio information feature representation corresponding to each of the typical data combinations, may further include the substeps described below:

performing a combination operation on the second audio information feature representation to be processed and the first audio information feature representation to be processed to form a corresponding one of the relevant data combinations, i.e. the relevant data combination may comprise the second audio information feature representation to be processed and the first audio information feature representation to be processed, and performing a combination operation on the second audio information feature representation to be analyzed and the first audio information feature representation to be analyzed to form a corresponding one of the relevant data combinations, i.e. the relevant data combination may comprise the second audio information feature representation to be analyzed and the first audio information feature representation to be analyzed;

And performing a combination operation on the second audio information feature representation to be processed and the first audio information feature representation to be analyzed to form a corresponding one of the non-correlated data combinations, i.e. the non-correlated data combination may comprise the second audio information feature representation to be processed and the first audio information feature representation to be analyzed, and performing a combination operation on the second audio information feature representation to be analyzed and the first audio information feature representation to be processed to form a corresponding one of the non-correlated data combinations, i.e. the non-correlated data combination may comprise the second audio information feature representation to be analyzed and the first audio information feature representation to be processed.

It should be appreciated that, in some possible embodiments, step S140 in the foregoing implementation, that is, the step of performing, based on the relevant data combination and the irrelevant data combination, an optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network to form a first target audio mining network and a second target audio mining network, may further include the substeps described below:

For any one of the second audio information feature representations, calculating a first product (e.g., a dot product, or a sum of the values of the parameters multiplied by bits) between the second audio information feature representation and the first audio information feature representation belonging to a related data combination with the second audio information feature representation;

for any one second audio information feature representation, respectively calculating a second product between the second audio information feature representation and a first audio information feature representation belonging to the same non-relevant data combination as the second audio information feature representation, so that at least one corresponding second product of the second audio information feature representation can be obtained, wherein the number of the second products is equal to the number of the non-relevant data combinations where the second audio information feature representation is located;

performing optimization adjustment operation on the first products to form first optimization parameters, and performing optimization adjustment operation on each second product to form corresponding second optimization parameters;

analyzing the second audio information characteristic to represent corresponding local error index parameters based on the first optimization parameters and each second optimization parameter;

And performing an optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network based on each second audio information feature representation corresponding local error index parameter to form a first target audio mining network and a second target audio mining network, for example, performing weighted summation on each second audio information feature representation corresponding local error index parameter to form a target error index parameter, and then performing an optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network along the direction of reducing the target error index parameter.

Wherein it should be understood that, in some possible embodiments, the step of performing an optimization adjustment operation on the first products to form first optimization parameters, and performing an optimization adjustment operation on each of the second products to form corresponding second optimization parameters, may further include the substeps described below:

calculating the ratio between the first product and a pre-configured adjustment parameter to form a first optimization parameter, wherein the adjustment parameter is used for adjusting the attention degree of complex typical data, the smaller adjustment parameter is more focused on distinguishing the typical data from other most similar typical data, the specific value of the adjustment parameter is not limited, and the adjustment parameter can be configured according to actual requirements;

For each second product, calculating a ratio between the second product and the adjustment parameter to form a second optimization parameter corresponding to the second product.

Wherein it should be understood that, in some possible embodiments, the step of analyzing the second audio information feature to represent the corresponding local error indicator parameter based on the first optimization parameter and each of the second optimization parameters may further include the substeps described below:

performing exponential operation on the first optimization parameters to obtain first exponential operation values, and performing exponential operation on each second optimization parameter to obtain second exponential operation values;

calculating the sum of each second exponent operation value to obtain a target sum, calculating the difference between the first exponent operation value and the target sum to obtain a target difference, calculating the ratio between the first exponent operation value and the target difference to obtain a target ratio, performing logarithmic operation on the target ratio to obtain a target logarithmic result, and determining that the second audio information feature represents a corresponding local error index parameter based on the target logarithmic result, wherein the local error index parameter and the target logarithmic result are inversely related, for example, the sum between the local error index parameter and the target logarithmic result is equal to a predetermined target value, and the specific value of the target value is not limited, such as 0, 1, 2, 3 and the like.

It should be appreciated that, in some possible embodiments, the step S160 in the foregoing implementation, that is, the step of using the target data denoising network to perform denoising processing based on the first target audio feature representation and the second target audio feature representation to output denoising audio digitized data corresponding to the target audio digitized data, may further include the substeps described below:

performing focus feature analysis operation on the second target audio feature representation based on the first target audio feature representation by using a focus fusion unit included in the target data denoising network to output a focus audio feature representation corresponding to the second target audio feature representation, so that semantic information of the focus audio feature representation can be enhanced;

performing feature restoration operation on the focused audio feature representation by using a feature restoration unit included in the target data denoising network to output a restored audio digitized data segment corresponding to the second target audio feature representation, for example, the feature restoration unit may be a decoding network;

and replacing the data segment corresponding to the second target audio feature representation in the target audio digitized data with the restored audio digitized data segment to form denoising audio digitized data corresponding to the target audio digitized data, namely replacing noisy data with noiseless data, so that denoising processing can be realized, and denoising frequency digitized data is obtained.

Wherein it should be understood that, in some possible embodiments, the step of performing, based on the first target audio feature representation, a focus feature analysis operation on the second target audio feature representation to output a focus audio feature representation corresponding to the second target audio feature representation by using a focus fusion unit included in the target data denoising network, further includes the substeps described below:

mapping (e.g., multiplying) the second target audio feature representation by using a first mapping parameter distribution (which may be a matrix or a vector) carried by a first focusing fusion subunit included in a focusing fusion subunit included in the target data denoising network, so as to output a corresponding first mapping feature representation, and mapping the first target audio feature representation by using a second mapping parameter distribution and a third mapping parameter distribution carried by the first focusing fusion subunit, so as to output a corresponding second mapping feature representation and a corresponding third mapping feature representation, and multiplying and fusing a transposed result of the second mapping feature representation and the first mapping feature representation to obtain a corresponding similarity parameter, and weighting the third mapping feature representation based on the similarity parameter so as to output a corresponding first-stage focusing audio feature representation;

Mapping (e.g., multiplying) the first-stage focused audio feature representation by using a first mapping parameter distribution (which may be a matrix or a vector) carried by a second focusing fusion subunit included in a focusing fusion subunit included in the target data denoising network, so as to output a corresponding first mapping feature representation, and mapping the first target audio feature representation by using a second mapping parameter distribution and a third mapping parameter distribution carried by the second focusing fusion subunit, so as to output a corresponding second mapping feature representation and a third mapping feature representation, multiplying and fusing a transposed result of the second mapping feature representation and the first mapping feature representation, so as to obtain a corresponding similarity parameter, and weighting the third mapping feature representation based on the similarity parameter, so as to output a corresponding second-stage focused audio feature representation;

sequentially performing the steps of performing mapping operation (such as multiplication) on the penultimate focusing audio feature representation by using a first mapping parameter distribution (which may be a matrix or a vector) carried by a last focusing fusion subunit included in a focusing fusion unit included in the target data denoising network so as to output a corresponding first mapping feature representation, performing mapping operation on the first target audio feature representation by using a second mapping parameter distribution and a third mapping parameter distribution carried by the second focusing fusion subunit so as to output a corresponding second mapping feature representation and a corresponding third mapping feature representation, performing multiplication fusion on a transposed result of the second mapping feature representation and the first mapping feature representation so as to obtain a corresponding similarity parameter, and weighting the third mapping feature representation based on the similarity parameter so as to output a corresponding last focusing audio feature representation;

And performing aggregation operation on the final-stage focusing audio feature representation and the second target audio feature representation to output a focusing audio feature representation corresponding to the second target audio feature representation, wherein the aggregation operation can be superposition or cascade combination and the like.

Referring to fig. 3, an embodiment of the present invention further provides a denoising apparatus for digitized signal data, which is applicable to the foregoing denoising system for digitized signal data. Wherein the digitized signal data denoising apparatus (a software functional apparatus) may include:

a typical data extraction module for extracting a first number of typical data combinations, each of the typical data combinations comprising semantically related first audio digitized data and second audio digitized data, the second audio digitized data having noise data;

the information feature mining module is used for mining first audio information feature representations corresponding to the first audio digital data by using a first initial audio mining network for each typical data combination; and for each of the typical data combinations, mining a second audio information feature representation corresponding to the second audio digitized data using a second initial audio mining network;

An information feature combination module for constructing at least one correlated data combination and at least one uncorrelated data combination based on a first audio information feature representation and a second audio information feature representation corresponding to each of the typical data combinations, each of the correlated data combinations comprising a first audio information feature representation and a second audio information feature representation belonging to the same typical data combination, each of the uncorrelated data combinations comprising a first audio information feature representation and a second audio information feature representation belonging to different typical data combinations;

the network optimization adjustment module is used for performing optimization adjustment operation on the network parameters of the first initial audio mining network and the network parameters of the second initial audio mining network based on the related data combination and the non-related data combination so as to form a first target audio mining network and a second target audio mining network;

the audio feature mining module is used for mining different data fragments in the target audio digital data by utilizing the first target audio mining network and the second target audio mining network respectively so as to mine out a first target audio feature representation and a second target audio feature representation, wherein the data fragments corresponding to the second target audio feature representation belong to fragments with noise data in the target audio digital data;

The audio denoising processing module is used for denoising processing based on the first target audio feature representation and the second target audio feature representation by utilizing a target data denoising network so as to output denoising audio digitized data corresponding to the target audio digitized data, wherein the target data denoising network is formed by performing network optimization operation on an initial data denoising network based on a first audio information feature representation and a second audio information feature representation respectively mined by the first target audio mining network and the second target audio mining network and combining noise-free audio digitized data serving as labels.

In summary, the method and system for denoising digitized signal data provided by the invention can firstly mine out the first audio information characteristic representation of the first audio digitized data; mining a second audio information characteristic representation of the second audio digitized data; constructing a correlated data combination and a non-correlated data combination based on the first audio information feature representation and the second audio information feature representation; forming a first target audio mining network and a second target audio mining network based on the correlated data combination and the uncorrelated data combination; mining a first target audio feature representation and a second target audio feature representation using the first target audio mining network and the second target audio mining network; and outputting the denoising audio digitized data based on the first target audio feature representation and the second target audio feature representation by using the target data denoising network. Based on the above, the first target audio feature representation and the second target audio feature representation are respectively mined by using the first target audio mining network and the second target audio mining network which are formed by network optimization, so that the mining precision is higher, and therefore, the first target audio feature representation and the second target audio feature representation are used as the basis of the denoising audio digitized data, namely, the precision of the first target audio feature representation and the second target audio feature representation is higher, so that the reliability of the obtained denoising audio digitized data is higher, the reliability of denoising processing is improved to a certain extent, and the problem of low reliability in the prior art is solved.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for denoising digitized signal data, comprising:

2. The method of denoising digitized signal data of claim 1 wherein the step of extracting a first number of representative data combinations comprises:

3. The method of denoising digitized signal data of claim 1 wherein for each of said typical data combinations, the step of mining a first representation of audio information characteristics corresponding to said first audio digitized data using a first initial audio mining network comprises:

4. A method of denoising digitized signal data as claimed in claim 3 wherein the step of splitting and combining the first audio digitized data to form a representative first ordered set comprises:

5. A method of denoising digitized signal data as claimed in claim 3 wherein the step of splitting and combining the second audio digitized data to form a representative second ordered set comprises:

6. The method of denoising digitized signal data of claim 5 wherein the step of performing a split combining operation in the second audio digitized data based on the first number of data segments to form a representative second cluster of data segments comprises:

7. The method of denoising digitized signal data of claim 3 wherein the step of mining a first representation of audio information characteristics corresponding to the first digitized audio data using the first initial audio mining network in accordance with the representative first ordered set comprises:

8. The method of denoising digitized signal data of claim 1 wherein the first number of canonical data combinations comprises a first canonical data combination comprising semantically related first audio digitized data to be processed and second audio digitized data to be processed and a second canonical data combination comprising semantically related first audio digitized data to be analyzed and second audio digitized data to be analyzed;

9. The method for denoising digitized signal data of any one of claims 1 to 8, wherein the step of denoising, using a target data denoising network, based on the first target audio feature representation and the second target audio feature representation, to output denoised audio digitized data corresponding to the target audio digitized data comprises:

10. A digitized signal data denoising system comprising a processor and a memory, the memory for storing a computer program, the processor for executing the computer program to implement the method of any one of claims 1-9.