CN114900779B

CN114900779B - Audio compensation method, system and electronic equipment

Info

Publication number: CN114900779B
Application number: CN202210383817.6A
Authority: CN
Inventors: 李怀子; 李建军; 袁德中
Original assignee: Honsenn Technology Co ltd
Current assignee: Honsenn Technology Co ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2023-06-06
Anticipated expiration: 2042-04-12
Also published as: CN114900779A

Abstract

The utility model discloses an audio compensation method, system and electronic equipment, it is through the convolution neural network of Taming converter to extract the high-dimensional local feature distribution of the audio waveform diagram of first earphone and second earphone in the time dimension, and obtain differential feature matrix through doing the difference mode and obtain the difference between them, and excavate the global correlation characteristic of the people's focus change data in this preset time zone through the encoder and obtain the third feature matrix, further, reconstruct the image semantic of wave form according to time sequence confidence through the context expression of discrete data, in order to be used for describing the feature expression of differential feature matrix and the feature expression of third feature matrix respectively in the distribution similarity under the different dimension visual angles of high-dimensional feature space, constraint the correlation of local feature description between the feature matrix through the geometric similarity of distribution, thereby strengthen the dependency between the feature expression in the training process of Taming model autoregressively.

Description

Audio compensation method, system and electronic equipment

Technical Field

The present application relates to the field of audio compensating headphones, and more particularly, to an audio compensating method, system, and electronic device.

Background

A hearing aid is an instrument that helps to improve hearing, and in fact is a small semiconductor loudspeaker that acts to amplify relatively weak sounds that are transmitted to headphones so that the original hearing-impaired location hears the sounds by amplification.

Currently, hearing aids are required to measure the hearing curves of left and right ears through a hearing test and compensate the hearing of different frequency points according to the hearing curves. However, the current method for hearing test is inconvenient, and requires going to a hospital or a professional institution to test a hearing curve to compensate for the hearing, which may cause difficulty to people using the hearing aid. The existing hearing aid also only aims at the conversation frequency band, and cannot compensate the audio in the full frequency band, for example, people wearing the hearing aid can hardly hear stereo audio music in the motion due to the influence of the environmental noise in the motion. Therefore, in order to better eliminate motion noise, an audio compensation method is desired.

At present, deep learning and neural networks have been widely used in the fields of computer vision, natural language processing, speech signal processing, and the like. In addition, deep learning and neural networks have also shown levels approaching and even exceeding humans in the fields of image classification, object detection, semantic segmentation, text translation, and the like.

The development of deep learning and neural networks provides new solutions and solutions for audio compensation.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides an audio compensation method, an audio compensation system and electronic equipment, which are characterized in that high-dimensional local feature distribution of an audio waveform diagram of a first earphone and an audio waveform diagram of a second earphone in a time dimension is extracted through a convolutional neural network of a Taming converter, difference feature matrixes are obtained through difference, global associated features of gravity center change data of people in a preset time period are mined through an encoder to obtain a third feature matrix, further, image semantics of waveforms are reconstructed according to time sequence confidence through context expression of discrete data, and the dependence among feature expressions is restrained through the distributed geometric similarity through the feature expression of the differential feature matrixes and the feature expression of the third feature matrix under different dimensional view angles in a high-dimensional feature space, so that the dependence among the feature expressions is strengthened in a self-regression mode in the training process of a Taming model.

According to one aspect of the present application, there is provided an audio compensation method, comprising:

a training phase comprising:

acquiring first audio data transmitted to a first earphone in a preset time period and second audio data transmitted to a second earphone from the first earphone;

acquiring gravity center change data of a person in the preset time period;

respectively passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix;

calculating the difference according to the position between the first feature matrix and the second feature matrix to obtain a difference feature matrix;

the gravity center change data of the person in the preset time period pass through a context encoder of the Taming converter, which comprises an embedded layer, so as to convert the gravity center change data of the person in the preset time period into a plurality of feature vectors, and the feature vectors are arranged in two dimensions to obtain a third feature matrix;

calculating Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term;

calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term, wherein the manifold dimension distribution similarity is determined based on a ratio between a cosine distance between the differential feature matrix and the third feature matrix and a two-norm of a differential matrix between the differential feature matrix and the third feature matrix; and

Calculating a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter; and

an inference phase comprising:

acquiring second audio data transmitted to a second earphone;

passing the waveform diagram of the second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and

the feature matrix to be decoded is passed through a generator to generate a hearing compensation curve corresponding to the second ear.

According to another aspect of the present application, there is provided an audio compensation system, comprising:

a training module, comprising:

an audio data acquisition unit configured to acquire first audio data propagated to a first headphone and second audio data propagated from the first headphone to a second headphone within a preset period of time;

a gravity center data acquisition unit for acquiring gravity center change data of the person in the preset time period;

the feature extraction unit is used for respectively passing the waveform diagram of the first audio data obtained by the audio data obtaining unit and the waveform diagram of the second audio data obtained by the audio data obtaining unit through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix;

A difference unit configured to calculate a per-position difference between the first feature matrix obtained by the feature extraction unit and the second feature matrix obtained by the feature extraction unit to obtain a differential feature matrix;

the encoding unit is used for converting the gravity center change data of the person in the preset time period into a plurality of feature vectors through a context encoder of the Taming converter, wherein the context encoder comprises an embedded layer, and the gravity center change data of the person in the preset time period is obtained by the gravity center data obtaining unit, and the feature vectors are two-dimensionally arranged to obtain a third feature matrix;

a first loss term calculation unit configured to calculate, as a first loss term, a euclidean distance between the differential feature matrix obtained by the differential unit and the third feature matrix obtained by the encoding unit;

a second loss term calculation unit configured to calculate, as a second loss term, manifold dimension distribution similarity between the differential feature matrix obtained by the differential unit and the third feature matrix obtained by the encoding unit, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the differential feature matrix and the third feature matrix and a two-norm of a differential matrix between the differential feature matrix and the third feature matrix; and

A training unit for calculating a weighted sum of the first loss term obtained by the first loss term calculation unit and the second loss term obtained by the second loss term calculation unit as a loss function value to train a convolutional neural network of the Taming converter; and

an inference module comprising:

an inferred audio data acquisition unit configured to acquire second audio data propagated to a second headphone;

the decoding feature matrix generating unit is used for passing the waveform diagram of the second audio data obtained by the inferred audio data obtaining unit through the convolutional neural network which is completed through training so as to obtain a feature matrix to be decoded; and

and the decoding unit is used for passing the feature matrix to be decoded obtained by the decoding feature matrix generating unit through a generator to generate a hearing compensation curve corresponding to the second ear.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the audio compensation method as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the audio compensation method as described above.

According to the audio compensation method, the system and the electronic device, the high-dimensional local feature distribution of the audio waveform diagrams of the first earphone and the second earphone in the time dimension is extracted through the convolutional neural network of the Taming converter, the difference between the two is obtained through a difference mode to obtain a differential feature matrix, global associated features of gravity center change data of people in the preset time period are mined through an encoder to obtain a third feature matrix, further, image semantics of waveforms are reconstructed according to time sequence confidence through context expression of discrete data, so that the correlation of local feature descriptions among the feature matrices is restrained through the distributed geometric similarity through the distribution similarity of the feature expressions used for describing the differential feature matrix and the feature expressions of the third feature matrix under different dimensional view angles in the high-dimensional feature space, and the dependence among the feature expressions is strengthened in an autoregressive mode in the training process of the Taming model.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a schematic view of a scenario of an audio compensation method according to an embodiment of the present application.

Fig. 2A is a flowchart of a training phase in an audio compensation method according to an embodiment of the present application.

Fig. 2B is a flow chart of an inference phase in an audio compensation method according to an embodiment of the present application.

Fig. 3A is a schematic diagram of a training phase architecture in an audio compensation method according to an embodiment of the present application.

Fig. 3B is a schematic architecture diagram of an inference phase in an audio compensation method according to an embodiment of the present application.

Fig. 4 is a block diagram of an audio compensation system according to an embodiment of the present application.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Scene overview

As described above, in the prior art, hearing aids are required to measure the hearing curves of the left and right ears by a hearing test and compensate the hearing at different frequency points according to the hearing curves. However, the current method for hearing test is inconvenient, and requires going to a hospital or a professional institution to test a hearing curve to compensate for the hearing, which may cause difficulty to people using the hearing aid. The existing hearing aid also only aims at the conversation frequency band, and cannot compensate the audio in the full frequency band, for example, people wearing the hearing aid can hardly hear stereo audio music in the motion due to the influence of the environmental noise in the motion. Therefore, in order to better eliminate motion noise, an audio compensation method is desired.

Accordingly, in the solution of the present application, since audio data is first transferred from an audio output device (e.g. a smartphone) to a first earpiece and then from the first earpiece to a second earpiece, this results in a propagation offset between the audio data transferred to the first earpiece and the audio data transferred to the second earpiece, and this offset may create different deviations for the person wearing the hearing aid during movement, it is desirable to invoke an audio compensation curve in the music mode to cancel the movement noise, so that the left and right ear can continuously experience the effect of stereo. This is essentially a regression problem, i.e. the intelligent regression of the generated hearing compensation curve based on the audio data coming in to the first earpiece and the audio data coming in to the second earpiece to compensate the audio signal of the second earpiece, so that the influence of motion noise can be eliminated.

Specifically, in the technical scheme of the application, first audio data transmitted to a first earphone in a preset time period and second audio data transmitted to a second earphone from the first earphone are firstly obtained from the earphones, and gravity center change data of a human body in the preset time period is obtained through a sensor. And then, processing the waveform diagram of the first audio data and the waveform diagram of the second audio data in a convolutional neural network of a Taming converter respectively to extract high-dimensional local feature distribution of the waveform diagrams of the first audio data and the second audio data respectively, so as to obtain a first feature matrix and a second feature matrix. Then, the difference feature matrix can be obtained by calculating the difference according to the position between the first feature matrix and the second feature matrix, so that the difference feature of the audio feature of the first earphone and the audio feature of the second earphone can be obtained, and the hearing compensation curve of the second ear can be better extracted later.

And the obtained gravity center change data of the person in the preset time period is encoded in a context encoder containing an embedded layer of the Taming converter, so that the gravity center change data of the person in the preset time period is converted into a plurality of feature vectors with global gravity center feature associated information. And then, the feature vectors are two-dimensionally arranged to integrate feature association information of gravity center change of people in all preset time periods, so that a third feature matrix is obtained.

It should be appreciated that since the Taming model is applied by minimizing the differential feature matrix M ₂ And a third feature matrix M ₃ The Euclidean distance between the two to learn the context expression of the discrete data and reconstruct the image semantics according to the time sequence confidence, but in order to improve the dependency of the feature expression of the differential feature matrix on the feature expression of the third feature matrix, the constraint needs to be further carried out on different pipelines (pipeline) of the Taming model under different dimensions and local view angles.

That is, specifically, for the differential feature matrix M obtained by the Taming model ₂ And a third feature matrix M ₃ In addition to minimizing the Euclidean distance between the two, namely argmin M ₂ -M ₃ || ₂ An additional term is further added that minimizes the manifold dimension distribution similarity between the two, expressed as:

/>

wherein cos (M) ₂ ，M ₃ ) Representing a differential feature matrix M ₂ And a third feature matrix M ₃ Cosine distance between them.

Thus, the constraints of the Taming model are generally expressed as:

where α and β are weighted superparameters and are initially set to α > β.

The weighted sum of the two penalty terms can then be calculated to train the convolutional neural network of the Tamine converter.

In this way, the corrected differential matrix obtained by the Taming model not only can reconstruct the image semantics of the waveform according to the time sequence confidence through the context expression of the discrete data, but also can restrict the relevance of the local feature description among the feature matrices through the geometric similarity of the distribution through the distribution similarity of the feature expression used for describing the differential feature matrix and the feature expression of the third feature matrix under different dimensional view angles in the high-dimensional feature space, thereby autoregressively strengthening the dependency among the feature expressions in the training process of the Taming model.

After training is completed, in the process of inference, the second audio data of the second earphone can be directly input into the convolutional neural network after training is completed for feature extraction, so that a feature matrix to be decoded is obtained, and then the feature matrix is decoded and returned through a generator to obtain a hearing compensation curve corresponding to the second ear.

Based on this, the present application proposes an audio compensation method comprising: a training phase and an inference phase. Wherein the training phase comprises the steps of: acquiring first audio data transmitted to a first earphone in a preset time period and second audio data transmitted to a second earphone from the first earphone; acquiring gravity center change data of a person in the preset time period; respectively passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix; calculating the difference according to the position between the first feature matrix and the second feature matrix to obtain a difference feature matrix; the gravity center change data of the person in the preset time period pass through a context encoder of the Taming converter, which comprises an embedded layer, so as to convert the gravity center change data of the person in the preset time period into a plurality of feature vectors, and the feature vectors are arranged in two dimensions to obtain a third feature matrix; calculating Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term; calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term, wherein the manifold dimension distribution similarity is determined based on a ratio between a cosine distance between the differential feature matrix and the third feature matrix and a two-norm of a differential matrix between the differential feature matrix and the third feature matrix; and computing a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter. Wherein the inference phase comprises the steps of: acquiring second audio data transmitted to a second earphone; passing the waveform diagram of the second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and passing the feature matrix to be decoded through a generator to generate a hearing compensation curve corresponding to the second ear.

Fig. 1 illustrates a schematic view of a scenario of an audio compensation method according to an embodiment of the present application. As shown in fig. 1, in the training phase of the application scenario, first audio data that propagates to a first earphone (e.g., H1 as illustrated in fig. 1) from an audio output device (e.g., T as illustrated in fig. 1) for a preset period of time and second audio data that propagates to a second earphone (e.g., H2 as illustrated in fig. 1) from the first earphone are acquired from an earphone worn by a moving human body (e.g., P as illustrated in fig. 1), and gravity center change data of the human body for this preset period of time is acquired by a sensor (e.g., R as illustrated in fig. 1) provided in the earphone. Here, the audio output device includes, but is not limited to, a smart phone, a smart bracelet, and the like. Then, the first and second audio data and the center of gravity variation data of the human body for the preset period of time are input into a server (e.g., S as illustrated in fig. 1) in which an audio compensation algorithm is deployed, wherein the server is capable of training a convolutional neural network of the Tamine converter for audio compensation with the first and second audio data and the center of gravity variation data of the human body for the preset period of time based on the audio compensation algorithm.

After the training is completed, in the inference phase, first, second audio data propagated from the first earphone (e.g., H1 as illustrated in fig. 1) to the second earphone (e.g., H2 as illustrated in fig. 1) for a preset period of time is acquired from the earphones worn by the moving human body (e.g., P as illustrated in fig. 1). The second audio data is then input into a server (e.g., S as illustrated in fig. 1) that is deployed with an audio compensation algorithm, wherein the server is capable of processing the second audio data with the audio compensation algorithm to generate a hearing compensation curve corresponding to the second ear.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary method

Fig. 2A illustrates a flowchart of a training phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 2A, an audio compensation method according to an embodiment of the present application includes: the training stage comprises the following steps: s110, acquiring first audio data transmitted to a first earphone in a preset time period and second audio data transmitted to a second earphone from the first earphone; s120, acquiring gravity center change data of the person in the preset time period; s130, respectively passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix; s140, calculating the difference between the first feature matrix and the second feature matrix according to the position to obtain a difference feature matrix; s150, the gravity center change data of the person in the preset time period pass through a context encoder of the Taming converter, which comprises an embedded layer, so as to convert the gravity center change data of the person in the preset time period into a plurality of feature vectors, and the feature vectors are arranged in two dimensions so as to obtain a third feature matrix; s160, calculating the Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term; s170, calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term, wherein the manifold dimension distribution similarity is determined based on the ratio between the cosine distance between the differential feature matrix and the third feature matrix and the second norm of the differential matrix between the differential feature matrix and the third feature matrix; and S180, calculating a weighted sum of the first loss term and the second loss term as a loss function value to train the convolutional neural network of the Taming converter.

Fig. 2B illustrates a flow chart of an inference phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 2B, the audio compensation method according to the embodiment of the present application further includes: an inference phase comprising the steps of: s210, acquiring second audio data transmitted to a second earphone; s220, passing the waveform diagram of the second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and S230, passing the feature matrix to be decoded through a generator to generate a hearing compensation curve corresponding to the second ear.

Fig. 3A illustrates an architectural diagram of a training phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 3A, in the training phase, in the network architecture, first, the acquired waveform diagram of the first audio data (e.g., P1 as illustrated in fig. 3A) and the acquired waveform diagram of the second audio data (e.g., P2 as illustrated in fig. 3A) are respectively passed through a convolutional neural network of a Taming converter (e.g., CNN as illustrated in fig. 3A) to obtain a first feature matrix (e.g., MF1 as illustrated in fig. 3A) and a second feature matrix (e.g., MF2 as illustrated in fig. 3A); next, calculating a per-position difference between the first feature matrix and the second feature matrix to obtain a differential feature matrix (e.g., MF as illustrated in fig. 3A); then, passing the gravity center change data of the person within the preset time period (e.g., Q as illustrated in fig. 3A) through a context encoder of the Taming converter (e.g., E as illustrated in fig. 3A) including an embedded layer to convert the gravity center change data of the person within the preset time period into a plurality of feature vectors (e.g., V as illustrated in fig. 3A), and two-dimensionally arranging the plurality of feature vectors to obtain a third feature matrix (e.g., MF3 as illustrated in fig. 3A); next, a euclidean distance between the differential feature matrix and the third feature matrix is calculated as a first loss term (e.g., LI1 as illustrated in fig. 3A); then, calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term (e.g., LI2 as illustrated in fig. 3A); and finally, calculating a weighted sum of the first and second loss terms as a loss function value to train the convolutional neural network of the Taming converter.

Fig. 3B illustrates an architectural diagram of an inference phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 3B, in the inference phase, in the network structure, first, a waveform diagram of the obtained second audio data (e.g., P as illustrated in fig. 3B) is passed through the convolutional neural network (e.g., CN as illustrated in fig. 3B) that is trained to obtain a feature matrix to be decoded (e.g., MF as illustrated in fig. 3B); and then passing the feature matrix to be decoded through a generator (e.g., GE as illustrated in fig. 3B) to generate a hearing compensation curve corresponding to the second ear.

More specifically, in the training phase, in step S110 and step S120, first audio data that propagates to a first headphone and second audio data that propagates from the first headphone to a second headphone within a preset period of time are acquired, and gravity center change data of a person within the preset period of time is acquired. As previously described, since audio data is first transferred from an audio output device (e.g., a smartphone) to a first earpiece and then from the first earpiece to a second earpiece, this results in a propagation offset between the audio data transferred to the first earpiece and the audio data transferred to the second earpiece, and this offset may create a different bias for the person wearing the hearing aid during movement, it is desirable to invoke an audio compensation profile in the music mode to cancel the movement noise so that the left and right ears can continue to experience the stereo effect. This is essentially a regression problem, i.e. the intelligent regression generation of a hearing compensation curve based on the audio data coming in to the first earpiece and the audio data coming in to the second earpiece to compensate the audio signal of the second earpiece, whereby the influence of motion noise can be eliminated.

Specifically, in the technical scheme of the application, first audio data transmitted from an audio output device to a first earphone and second audio data transmitted from the first earphone to a second earphone in a preset time period are firstly obtained from earphones worn by a moving human body, and gravity center change data of the human body in the preset time period is obtained through a sensor arranged in the earphones. Here, the audio output device includes, but is not limited to, a smart phone, a smart bracelet, and the like.

More specifically, in the training phase, in step S130 and step S140, the waveform of the first audio data and the waveform of the second audio data are respectively passed through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix, and a per-position difference between the first feature matrix and the second feature matrix is calculated to obtain a differential feature matrix. It should be understood that, in order to extract the high-dimensional correlation characteristics of the audio signal data obtained by the first earphone and the audio signal data of the second earphone in the time sequence dimension, in the technical solution of the present application, the waveform diagram of the first audio data and the waveform diagram of the second audio data are further processed in the convolutional neural network of the Taming converter, so as to extract the high-dimensional local characteristic distribution of the waveform diagrams of the first audio data and the second audio data, so as to obtain a first feature matrix and a second feature matrix. Then, the difference feature matrix can be obtained by calculating the difference between the first feature matrix and the second feature matrix according to the position, so that the difference feature of the audio feature of the first earphone and the audio feature of the second earphone can be obtained, and the hearing compensation curve of the second ear can be better extracted later.

Specifically, in the embodiment of the present application, a process of passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix includes: each layer of the convolutional neural network of the Taming converter is used for respectively carrying out convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transmission of the layers so as to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transfer of layers by using each layer of a convolution neural network of the Taming converter to output the second feature matrix by the last layer of the convolution neural network, wherein the input of the first layer of the convolution neural network is a waveform diagram of the second audio data.

More specifically, in the training phase, in step S150, the gravity center change data of the person in the preset time period is passed through the context encoder including the embedded layer of the Taming converter to convert the gravity center change data of the person in the preset time period into a plurality of feature vectors, and the plurality of feature vectors are two-dimensionally arranged to obtain a third feature matrix. That is, further, in the technical solution of the present application, the obtained barycenter change data of the person in the preset time period is encoded in the context encoder including the embedded layer of the Taming converter, so as to convert the barycenter change data of the person in the preset time period into a plurality of feature vectors having global barycenter feature association information. And then, the feature vectors are two-dimensionally arranged to integrate feature association information of gravity center change of people in all preset time periods, so that a third feature matrix is obtained.

Specifically, in the embodiment of the present application, the process of converting the gravity center change data of the person in the preset time period into a plurality of feature vectors by the context encoder including the embedded layer of the Taming converter includes: firstly, respectively converting gravity center change data of a person in the preset time period into input vectors by using an embedded layer of a context encoder of the Taming converter to obtain a sequence of the input vectors. The sequence of input vectors is then globally based context semantic encoded using a transformer of a context encoder model of the Taming transformer to obtain the plurality of feature vectors.

More specifically, in the training phase, in step S160, the euclidean distance between the differential feature matrix and the third feature matrix is calculated as a first loss term. That is, in the technical solution of the present application, in order to eliminate the motion noise generated during the motion, the differential feature matrix M is minimized ₂ And the third feature matrix M ₃ The Euclidean distance between them) to learn the contextual expression of the discrete data and reconstruct image semantics with confidence in terms of timing.

Specifically, in the embodiment of the present application, the process of calculating, as a first loss term, the euclidean distance between the differential feature matrix and the third feature matrix includes: calculating the Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein, the formula is:

argmin||M ₂ -M ₃ || ₂

wherein M is ₂ For the second feature matrix, M ₃ And the third characteristic matrix.

More specifically, in the training phase, in step S170, manifold dimension distribution similarity between the differential feature matrix and the third feature matrix based on a ratio between cosine distances between the differential feature matrix and the third feature matrix and two norms of the differential matrix between the differential feature matrix and the third feature matrix is calculated as a second loss termAnd (5) value determination. It should be appreciated that since the Taming model is applied by minimizing the differential feature matrix M ₂ And the third feature matrix M ₃ The euclidean distance between the two, to learn the context expression of the discrete data and reconstruct the image semantics with confidence according to the time sequence, but in order to improve the dependency of the feature expression of the differential feature matrix on the feature expression of the third feature matrix, further constraints need to be further applied to different pipelines (pipelines) of the Taming model under different dimensions and local view angles.

That is, specifically, in the technical solution of the present application, for the differential feature matrix M obtained by the Taming model ₂ And the third feature matrix M ₃ In addition to minimizing the Euclidean distance between the two, namely argmin M ₂ -M ₃ || ₂ Additional terms are further added that minimize the manifold dimension distribution similarity between the two.

Specifically, in the embodiment of the present application, calculating, as the second loss term, manifold dimension distribution similarity between the differential feature matrix and the third feature matrix includes: calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term according to the following formula;

wherein, the formula is:

wherein cos (M) ₁ ，M ₂ ) Representing the first feature matrix M ₁ And the second feature matrix M ₂ Cosine distance between them.

More specifically, in a training phase, in step S180, a weighted sum of the first and second loss terms is calculated as a loss function value to train the convolutional neural network of the Taming converter. That is, in the technical solution of the present application, after the first loss term and the second loss term are obtained, the convolutional neural network of the Taming converter is further trained with a weighted sum of the two as a loss function value. It should be appreciated that in this way, the corrected differential matrix obtained by the Tamine model not only can reconstruct the image semantics of the waveform according to the time sequence confidence through the context expression of the discrete data, but also can restrict the relevance of the local feature descriptions among the feature matrices through the geometrical similarity of the distribution through the distribution similarity for describing the feature expression of the differential feature matrix and the feature expression of the third feature matrix under different dimensional view angles in the high-dimensional feature space respectively, so that the dependency among the feature expressions is enhanced autoregressively in the training process of the Tamine model.

Specifically, in the embodiment of the present application, a process for calculating a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the Taming converter includes: calculating a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter with the following formula;

wherein, the formula is:

where α and β are weighted superparameters and are initially set to α > β.

After training is completed, an inference phase is entered. In other words, in the process of inference, the second audio data of the second earphone can be directly input into the convolutional neural network after training is completed to perform feature extraction, so as to obtain a feature matrix to be decoded, and then the feature matrix is passed through a generator to obtain a hearing compensation curve corresponding to the second ear through decoding regression.

Specifically, in the embodiment of the present application, first, second audio data propagated to the second headphones is acquired. Then, the waveform diagram of the second audio data is passed through the trained convolutional neural network to obtain a feature matrix to be decoded. Finally, the feature matrix to be decoded is passed through a generator to generate a hearing compensation curve corresponding to the second ear.

In summary, the audio compensation method according to the embodiment of the present application is illustrated, which extracts high-dimensional local feature distribution of the audio waveform diagrams of the first earphone and the second earphone in the time dimension through a convolutional neural network of a Taming converter, obtains the difference between the two to obtain a differential feature matrix through a difference manner, and mines global associated features of gravity center change data of a person in a preset time period through an encoder to obtain a third feature matrix, and further reconstructs image semantics of waveforms according to time sequence confidence through context expression of discrete data, so that the correlation of local feature descriptions between feature matrices is constrained through the geometric similarity of the distribution through the distribution similarity of the feature expressions for describing the differential feature matrix and the feature expressions of the third feature matrix under different dimensional view angles in the high-dimensional feature space, thereby enhancing the dependency between feature expressions in an autoregressive manner in the training process of the Taming model.

Exemplary System

Fig. 4 illustrates a block diagram of an audio compensation system according to an embodiment of the present application. As shown in fig. 4, an audio compensation system 400 according to an embodiment of the present application includes: training module 410 and inference module 420.

As shown in fig. 4, the training module 410 includes: an audio data obtaining unit 411, configured to obtain first audio data that propagates to a first earphone and second audio data that propagates from the first earphone to a second earphone within a preset period of time; a barycenter data acquisition unit 412 for acquiring barycenter change data of the person within the preset time period; a feature extraction unit 413, configured to pass the waveform diagram of the first audio data obtained by the audio data obtaining unit 411 and the waveform diagram of the second audio data obtained by the audio data obtaining unit 411 through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix, respectively; a difference unit 414 for calculating a per-position difference between the first feature matrix obtained by the feature extraction unit 413 and the second feature matrix obtained by the feature extraction unit 413 to obtain a differential feature matrix; an encoding unit 415 configured to convert the barycenter change data of the person within the preset time period obtained by the barycenter data obtaining unit 412 into a plurality of feature vectors through a context encoder including an embedded layer of the Taming converter, and two-dimensionally arrange the plurality of feature vectors to obtain a third feature matrix; a first loss term calculation unit 416 for calculating, as a first loss term, a euclidean distance between the differential feature matrix obtained by the differential unit 414 and the third feature matrix obtained by the encoding unit 415; a second loss term calculation unit 417 for calculating, as a second loss term, manifold dimension distribution similarity between the differential feature matrix obtained by the differential unit 414 and the third feature matrix obtained by the encoding unit 415, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the differential feature matrix and the third feature matrix and a two-norm of a differential matrix between the differential feature matrix and the third feature matrix; and a training unit 418 for calculating a weighted sum of the first loss term obtained by the first loss term calculation unit 416 and the second loss term obtained by the second loss term calculation unit 417 as a loss function value to train the convolutional neural network of the Taming converter.

As shown in fig. 4, the inference module 420 includes: an inferred audio data acquisition unit 421 for acquiring second audio data propagated to a second earphone; a decoding feature matrix generating unit 422, configured to pass the waveform diagram of the second audio data obtained by the inferred audio data obtaining unit 421 through the convolutional neural network after training to obtain a feature matrix to be decoded; and a decoding unit 423 for passing the feature matrix to be decoded obtained by the decoding feature matrix generating unit 422 through a generator to generate a hearing compensation curve corresponding to the second ear.

In one example, in the above-mentioned audio compensation system 400, the feature extraction unit 413 is further configured to: each layer of the convolutional neural network of the Taming converter is used for respectively carrying out convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transmission of the layers so as to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transfer of layers by using each layer of a convolution neural network of the Taming converter to output the second feature matrix by the last layer of the convolution neural network, wherein the input of the first layer of the convolution neural network is a waveform diagram of the second audio data.

In one example, in the above-mentioned audio compensation system 400, the encoding unit 415 is further configured to: respectively converting gravity center change data of a person in the preset time period into input vectors by using an embedded layer of a context encoder of the Taming converter so as to obtain a sequence of the input vectors; and performing global-based context semantic coding on the sequence of input vectors using a transformer of a context encoder model of the Taming transformer to obtain the plurality of feature vectors.

In one example, in the above-mentioned audio compensation system 400, the first loss term calculating unit 416 is further configured to: calculating the Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein, the formula is:

argmin‖M ₂ -M ₃ ‖ ₂

wherein M is ₂ For the differential feature matrix, M ₃ And the third characteristic matrix.

In one example, in the above-mentioned audio compensation system 400, the second loss term calculating unit 417 is further configured to: calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term according to the following formula;

Wherein, the formula is:

wherein cos (M) ₂ ,M ₃ ) Representing the differentiated feature matrix M ₂ And the third feature matrix M ₃ Cosine distance between them.

In one example, in the above-described audio compensation system 400, the training unit 418 is further configured to: calculating a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter with the following formula;

wherein, the formula is:

where α and β are weighted superparameters and are initially set to α > β.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described audio compensation system 400 have been described in detail in the above description of the audio compensation method with reference to fig. 1 to 3B, and thus, repetitive descriptions thereof will be omitted.

As described above, the audio compensation system 400 according to the embodiment of the present application may be implemented in various terminal devices, such as a server of an audio compensation algorithm, and the like. In one example, the audio compensation system 400 according to embodiments of the present application may be integrated into the terminal device as one software module and/or hardware module. For example, the audio compensation system 400 may be a software module in the operating means of the terminal device or may be an application developed for the terminal device; of course, the audio compensation system 400 could equally be one of a number of hardware modules of the terminal device.

Alternatively, in another example, the audio compensation system 400 and the terminal device may be separate devices, and the audio compensation system 400 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information in a agreed data format.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present application is described with reference to fig. 5. As shown in fig. 5, the electronic device includes 10 includes one or more processors 11 and memory 12. The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 11 to implement the functions of the audio compensation methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as the second feature matrix, the first penalty term, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by bus means and/or other forms of connection mechanisms (not shown).

The input means 13 may comprise, for example, a keyboard, a mouse, etc.

The output device 14 may output various information to the outside, including a hearing compensation curve corresponding to the second ear, and the like. The output means 14 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 10 that are relevant to the present application are shown in fig. 5 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in the functions of the audio compensation method according to the various embodiments of the present application described in the "exemplary methods" section of the present specification.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in an audio compensation method described in the above "exemplary method" section of the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. An audio compensation method, comprising:

a training phase comprising:

Acquiring gravity center change data of a person in the preset time period;

an inference phase comprising:

acquiring second audio data for inference that is propagated to a second earphone;

passing a waveform diagram of the inferred second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and

2. The audio compensation method of claim 1, wherein passing the waveform map of the first audio data and the waveform map of the second audio data through a convolutional neural network of a Taming converter to obtain a first feature matrix and a second feature matrix, respectively, comprises:

each layer of the convolutional neural network of the Taming converter is used for respectively carrying out convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transmission of the layers so as to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and

And respectively carrying out convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transmission of layers by using each layer of a convolution neural network of the Taming converter so as to output the second feature matrix by the last layer of the convolution neural network, wherein the input of the first layer of the convolution neural network is a waveform diagram of the second audio data.

3. The audio compensation method of claim 2, wherein converting the gravity center change data of the person within the preset time period into a plurality of feature vectors by a context encoder of the Taming converter including an embedded layer, comprises:

respectively converting gravity center change data of a person in the preset time period into input vectors by using an embedded layer of a context encoder of the Taming converter so as to obtain a sequence of the input vectors; and

the sequence of input vectors is globally based context semantic encoded using a transformer of a context encoder model of the Taming transformer to obtain the plurality of feature vectors.

4. The audio compensation method of claim 3, wherein calculating the euclidean distance between the differential feature matrix and the third feature matrix as a first loss term comprises:

Calculating the Euclidean distance between the differential feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein, the formula is:

argmin||M ₂ -M ₃ || ₂

5. The audio compensation method of claim 4, wherein calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term comprises:

calculating manifold dimension distribution similarity between the differential feature matrix and the third feature matrix as a second loss term according to the following formula;

wherein, the formula is:

wherein cos (M) ₂ ,M ₃ ) Representing the differential feature matrix M ₂ And the third feature matrix M ₃ Cosine distance between them.

6. The audio compensation method of claim 5, wherein computing a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter comprises:

calculating a weighted sum of the first and second loss terms as a loss function value to train a convolutional neural network of the Taming converter with the following formula;

Wherein, the formula is:

where α and β are weighted superparameters and are initially set to α > β.

7. An audio compensation system, comprising:

a training module, comprising:

an inference module comprising:

an inferred audio data acquisition unit configured to acquire inferred second audio data propagated to a second headphone;

a decoding feature matrix generating unit configured to pass the waveform pattern of the second audio data for inference obtained by the inferred audio data obtaining unit through the convolutional neural network that is completed through training to obtain a feature matrix to be decoded; and

8. The audio compensation system of claim 7, wherein the feature extraction unit is further configured to:

each layer of the convolutional neural network of the Taming converter is used for respectively carrying out convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transmission of the layers so as to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward transfer of layers by using each layer of a convolution neural network of the Taming converter to output the second feature matrix by the last layer of the convolution neural network, wherein the input of the first layer of the convolution neural network is a waveform diagram of the second audio data.

9. The audio compensation system of claim 7, wherein the encoding unit is further configured to:

Respectively converting gravity center change data of a person in the preset time period into input vectors by using an embedded layer of a context encoder of the Taming converter so as to obtain a sequence of the input vectors; and performing global-based context semantic coding on the sequence of input vectors using a transformer of a context encoder model of the Taming transformer to obtain the plurality of feature vectors.

10. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the audio compensation method of any of claims 1-6.