CN114900779A

CN114900779A - Audio compensation method and system and electronic equipment

Info

Publication number: CN114900779A
Application number: CN202210383817.6A
Authority: CN
Inventors: 李怀子; 李建军; 袁德中
Original assignee: Honsenn Technology Co ltd
Current assignee: Honsenn Technology Co ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-08-12
Anticipated expiration: 2042-04-12
Also published as: CN114900779B

Abstract

The application discloses an audio compensation method, an audio compensation system and an electronic device, wherein high-dimensional local feature distribution of audio wave forms of a first earphone and a second earphone in a time dimension is extracted through a convolutional neural network of a tag converter, the difference of the audio wave forms of the first earphone and the second earphone is obtained in a difference mode to obtain a difference feature matrix, global correlation features of gravity center change data of people in the preset time period are mined through an encoder to obtain a third feature matrix, further, image semantics of a wave form are reconstructed according to time sequence confidence through context expression of discrete data, so that the correlation of local feature description among the feature matrices is restrained through the geometrical similarity of distribution by means of feature expression for describing the difference feature matrix and the distribution similarity of feature expression of the third feature matrix in different view angles in a high-dimensional feature space respectively, thereby enhancing the dependency between the feature expressions in an autoregressive manner during the training process of the Taming model.

Description

Audio compensation method and system and electronic equipment

Technical Field

The present application relates to the field of audio compensation headsets, and more particularly, to an audio compensation method, system and electronic device.

Background

A hearing aid is an instrument for improving hearing ability, and is actually a small semiconductor microphone, which amplifies relatively weak sounds and transmits the amplified sounds to an earphone, so that the user can hear the sounds by the amplification of the location where the hearing ability is originally reduced.

Currently, hearing aids need to measure the hearing curves of left and right ears through a hearing test and compensate the hearing at different frequency points according to the hearing curves. However, the current hearing test method is inconvenient, and requires a hospital or a professional institution to test a hearing curve to compensate for hearing, which causes difficulty to people using the hearing aid. In addition, the hearing compensation of the existing hearing aids is only for the conversation frequency band, and cannot compensate the full-frequency band audio, for example, due to the influence of the environmental noise during sports, people wearing the hearing aids have difficulty hearing stereo audio music during sports. Therefore, in order to better eliminate motion noise, an audio compensation method is desired.

At present, deep learning and neural networks have been widely applied in the fields of computer vision, natural language processing, speech signal processing, and the like. In addition, deep learning and neural networks also exhibit a level close to or even exceeding that of humans in the fields of image classification, object detection, semantic segmentation, text translation, and the like.

Deep learning and the development of neural networks provide new solutions and schemes for audio compensation.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides an audio compensation method, an audio compensation system and an electronic device, which extract high-dimensional local feature distribution of audio wave forms of a first earphone and a second earphone in a time dimension through a convolutional neural network of a tag converter, obtain difference of the two in a differencing way to obtain a difference feature matrix, excavate global associated features of gravity center change data of a person in a preset time period through an encoder to obtain a third feature matrix, further, reconstruct image semantics of a wave form according to time sequence confidence through context expression of discrete data, constrain association of local feature description among feature matrices through geometrical similarity of distribution through distribution similarity of feature expression for describing the difference feature matrix and feature expression of the third feature matrix respectively in different dimensional view angles in a high-dimensional feature space, thereby enhancing the dependency between the feature expressions in an autoregressive manner during the training process of the Taming model.

According to an aspect of the present application, there is provided an audio compensation method, including:

a training phase comprising:

acquiring first audio data transmitted to a first earphone and second audio data transmitted to a second earphone from the first earphone within a preset time period;

acquiring the change data of the center of gravity of the person in the preset time period;

respectively passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through a convolutional neural network of a tag converter to obtain a first characteristic matrix and a second characteristic matrix;

calculating a difference by position between the first feature matrix and the second feature matrix to obtain a difference feature matrix;

the gravity center change data of the people in the preset time period are converted into a plurality of feature vectors through a context encoder containing an embedded layer of the tag converter, and the feature vectors are arranged in a two-dimensional mode to obtain a third feature matrix;

calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss item;

calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a two-norm of a difference matrix between the difference feature matrix and the third feature matrix; and

computing a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the tag converter; and

an inference phase comprising:

obtaining second audio data propagated to a second headphone;

passing a waveform diagram of second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and

and passing the characteristic matrix to be decoded through a generator to generate a hearing compensation curve corresponding to a second ear.

According to another aspect of the present application, there is provided an audio compensation system comprising:

a training module comprising:

the earphone comprises an audio data acquisition unit, a first earphone and a second earphone, wherein the audio data acquisition unit is used for acquiring first audio data which are transmitted to a first earphone within a preset time period and second audio data which are transmitted to a second earphone from the first earphone;

the gravity center data acquisition unit is used for acquiring the gravity center change data of the person in the preset time period;

the feature extraction unit is used for enabling the oscillogram of the first audio data obtained by the audio data obtaining unit and the oscillogram of the second audio data obtained by the audio data obtaining unit to respectively pass through a convolutional neural network of a tag converter so as to obtain a first feature matrix and a second feature matrix;

a difference unit configured to calculate a difference by position between the first feature matrix obtained by the feature extraction unit and the second feature matrix obtained by the feature extraction unit to obtain a difference feature matrix;

the coding unit is used for enabling the gravity center change data of the person in the preset time period, which is obtained by the gravity center data obtaining unit, to pass through a context coder comprising an embedded layer of the tag converter so as to convert the gravity center change data of the person in the preset time period into a plurality of feature vectors, and two-dimensionally arranging the feature vectors so as to obtain a third feature matrix;

a first loss term calculation unit configured to calculate a euclidean distance between the difference feature matrix obtained by the difference unit and the third feature matrix obtained by the encoding unit as a first loss term;

a second loss term calculation unit configured to calculate, as a second loss term, manifold dimension distribution similarity between the difference feature matrix obtained by the difference unit and the third feature matrix obtained by the encoding unit, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a second norm of a difference matrix between the difference feature matrix and the third feature matrix; and

a training unit configured to calculate a weighted sum of the first loss term obtained by the first loss term calculation unit and the second loss term obtained by the second loss term calculation unit as a loss function value to train a convolutional neural network of the tag converter; and

an inference module comprising:

an inferred audio data acquisition unit for acquiring second audio data propagated to the second headphone;

the decoding feature matrix generating unit is used for enabling the oscillogram of the second audio data obtained by the inferred audio data obtaining unit to pass through the trained convolutional neural network so as to obtain a feature matrix to be decoded; and

and the decoding unit is used for enabling the characteristic matrix to be decoded obtained by the decoding characteristic matrix generating unit to pass through a generator so as to generate a hearing compensation curve corresponding to a second ear.

According to yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the audio compensation method as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the audio compensation method as described above.

According to the audio compensation method, system and electronic device provided by the application, the high-dimensional local feature distribution of the audio wave forms of the first earphone and the second earphone in the time dimension is extracted through the convolutional neural network of the tag converter, the difference of the two is obtained through a difference mode to obtain a difference feature matrix, the global correlation feature of the gravity center change data of people in the preset time period is mined through the encoder to obtain a third feature matrix, further, the image semantics of the wave forms are reconstructed according to time sequence confidence through the context expression of discrete data, so as to restrict the correlation of the local feature description among the feature matrices through the geometric similarity of distribution through the feature expression for describing the difference feature matrix and the distribution similarity of the feature expression of the third feature matrix respectively in different dimensional view angles in the high-dimensional feature space, thereby enhancing the dependency between the feature expressions in an autoregressive manner during the training process of the Taming model.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic view of a scene of an audio compensation method according to an embodiment of the present application.

Fig. 2A is a flowchart of a training phase in an audio compensation method according to an embodiment of the present application.

Fig. 2B is a flowchart of an inference phase in an audio compensation method according to an embodiment of the present application.

Fig. 3A is a schematic diagram illustrating an architecture of a training phase in an audio compensation method according to an embodiment of the present disclosure.

Fig. 3B is a schematic diagram illustrating an architecture of an inference stage in an audio compensation method according to an embodiment of the present application.

Fig. 4 is a block diagram of an audio compensation system according to an embodiment of the application.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Overview of a scene

As mentioned above, currently, hearing aids need to measure the hearing curves of the left and right ears through a hearing test and compensate the hearing at different frequency points according to the hearing curves. However, the current hearing test method is inconvenient, and requires a hospital or a professional organization to test a hearing curve to compensate for hearing, which causes difficulty to people using the hearing aid. In addition, the hearing compensation of the existing hearing aids is only for the conversation frequency band, and cannot compensate the full-frequency band audio, for example, due to the influence of the environmental noise during sports, people wearing the hearing aids have difficulty hearing stereo audio music during sports. Therefore, in order to better eliminate motion noise, an audio compensation method is desired.

Accordingly, in the technical solution of the present application, since audio data is first transmitted from an audio output device (e.g., a smartphone) to a first earphone and then transmitted from the first earphone to a second earphone, which causes a propagation offset between the audio data transmitted to the first earphone and the audio data transmitted to the second earphone, and the offset may generate different deviations for people wearing the hearing aid during sports, it is desirable to invoke an audio compensation curve in a music mode to eliminate sports noise, so that the left and right ears can continuously experience a stereo effect. This is essentially a regression problem, i.e. a hearing compensation curve is intelligently regressed based on the audio data transmitted to the first earpiece and the audio data transmitted to the second earpiece to compensate the audio signal of the second ear, so that the effects of motion noise can be eliminated.

Specifically, in the technical scheme of the application, first audio data transmitted to a first earphone within a preset time period and second audio data transmitted to a second earphone from the first earphone are firstly acquired from the earphones, and gravity center change data of a human body within the preset time period are acquired through a sensor. Then, the oscillogram of the first audio data and the oscillogram of the second audio data are respectively processed through a convolutional neural network of a tag converter so as to respectively extract high-dimensional local feature distribution of the oscillograms of the first audio data and the second audio data, and therefore a first feature matrix and a second feature matrix are obtained. Then, a difference feature matrix can be obtained by calculating the difference according to the position between the first feature matrix and the second feature matrix, so that the difference feature of the audio feature of the first earphone and the audio feature of the second earphone can be obtained, and the hearing compensation curve of the second ear can be extracted better in the following process.

And the obtained gravity center change data of the person in the preset time period is subjected to encoding processing in a context encoder containing an embedded layer of the tag converter, so that the gravity center change data of the person in the preset time period is converted into a plurality of feature vectors with global gravity center feature association information. Then, the plurality of feature vectors are two-dimensionally arranged to integrate feature correlation information of the change in the center of gravity of the person over all the preset time periods, thereby obtaining a third feature matrix.

It should be appreciated that the Taming model minimizes the difference feature matrix M ₂ And a third feature matrix M ₃ Euclidean distance between them to learn the context expression of the discrete data and to confidently reconstruct the image semantics in time series, butIn order to improve the dependency of the feature expression of the differential feature matrix on the feature expression of the third feature matrix, it is further required to constrain different pipelines (pipelines) of the tagging model under different dimensions and local perspectives.

That is, specifically, for the differential feature matrix M obtained by the Taming model ₂ And a third feature matrix M ₃ Outside of minimizing the Euclidean distance between the two, i.e. argmin M ₂ -M ₃ || ₂ An additional term is further added that minimizes the similarity of the manifold dimension distribution between the two, expressed as:

wherein cos (M) ₂ ，M ₃ ) Representing a differential feature matrix M ₂ And a third feature matrix M ₃ The cosine distance between.

Thus, the constraint of the Taming model is generally expressed as:

where α and β are weighted hyperparameters and are initially set to α > β.

The weighted sum of these two loss terms can then be computed to train the convolutional neural network of the tag converter.

In this way, the corrected difference matrix obtained by the Taming model can not only reconstruct the image semantics of the waveform according to time sequence confidence through the context expression of the discrete data, but also constrain the correlation of the local feature description between the feature matrices through the geometric similarity of the distribution through the distribution similarity of the feature expression for describing the difference feature matrix and the feature expression of the third feature matrix respectively under different dimensional viewing angles in the high-dimensional feature space, thereby strengthening the dependency between the feature expressions in an autoregressive manner in the training process of the Taming model.

After the training is finished, in the process of inference, the second audio data of the second earphone can be directly input into the trained convolutional neural network for feature extraction, so that a feature matrix to be decoded is obtained, and the feature matrix is decoded and regressed through a generator to obtain a hearing compensation curve corresponding to the second ear.

Based on this, the present application proposes an audio compensation method, which includes: a training phase and an inference phase. Wherein the training phase comprises the steps of: acquiring first audio data transmitted to a first earphone and second audio data transmitted to a second earphone from the first earphone within a preset time period; acquiring the gravity center change data of the person in the preset time period; respectively passing the oscillogram of the first audio data and the oscillogram of the second audio data through a convolutional neural network of a tag converter to obtain a first feature matrix and a second feature matrix; calculating a difference by position between the first feature matrix and the second feature matrix to obtain a difference feature matrix; the gravity center change data of the people in the preset time period are converted into a plurality of feature vectors through a context encoder containing an embedded layer of the tag converter, and the feature vectors are arranged in a two-dimensional mode to obtain a third feature matrix; calculating Euclidean distance between the differential feature matrix and the third feature matrix to serve as a first loss item; calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a two-norm of a difference matrix between the difference feature matrix and the third feature matrix; and calculating a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the tag converter. Wherein the inference phase comprises the steps of: obtaining second audio data propagated to a second headphone; passing a waveform diagram of second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and passing the characteristic matrix to be decoded through a generator to generate a hearing compensation curve corresponding to a second ear.

Fig. 1 illustrates a scene schematic diagram of an audio compensation method according to an embodiment of the present application. As shown in fig. 1, in the training phase of the application scenario, first audio data propagated from an audio output device (e.g., T as illustrated in fig. 1) to a first headphone (e.g., H1 as illustrated in fig. 1) and second audio data propagated from the first headphone to a second headphone (e.g., H2 as illustrated in fig. 1) within a preset period are acquired from headphones worn by a moving human body (e.g., P as illustrated in fig. 1), and barycentric variation data of the human body at this preset period are acquired by a sensor (e.g., R as illustrated in fig. 1) provided in the headphones. Here, the audio output device includes, but is not limited to, a smart phone, a smart band, and the like. Then, the first and second audio data and the gravity center change data of the human body for the preset time period are input into a server (e.g., S as illustrated in fig. 1) in which an audio compensation algorithm is deployed, wherein the server is capable of training a convolutional neural network of the audio compensated tag converter with the first and second audio data and the gravity center change data of the human body for the preset time period based on the audio compensation algorithm.

After training is completed, in an inference phase, first, second audio data propagated to a second headphone (e.g., H2 as illustrated in fig. 1) by the first headphone (e.g., H1 as illustrated in fig. 1) for a preset period of time is acquired from headphones worn by a moving human body (e.g., P as illustrated in fig. 1). The second audio data is then input into a server (e.g., S as illustrated in fig. 1) that is deployed with an audio compensation algorithm, wherein the server is capable of processing the second audio data with the audio compensation algorithm to generate a hearing compensation curve corresponding to a second ear.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

Fig. 2A illustrates a flow chart of a training phase in an audio compensation method according to an embodiment of the application. As shown in fig. 2A, an audio compensation method according to an embodiment of the present application includes: a training phase comprising the steps of: s110, acquiring first audio data transmitted to a first earphone in a preset time period and second audio data transmitted to a second earphone from the first earphone; s120, acquiring the change data of the center of gravity of the person in the preset time period; s130, respectively passing the oscillogram of the first audio data and the oscillogram of the second audio data through a convolutional neural network of a tag converter to obtain a first characteristic matrix and a second characteristic matrix; s140, calculating the position-based difference between the first feature matrix and the second feature matrix to obtain a difference feature matrix; s150, the gravity center change data of the people in the preset time period is converted into a plurality of feature vectors through a context encoder containing an embedded layer of the tag converter, and the feature vectors are arranged in a two-dimensional mode to obtain a third feature matrix; s160, calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss item; s170, calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss item, wherein the manifold dimension distribution similarity is determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a two-norm of a difference matrix between the difference feature matrix and the third feature matrix; and S180, calculating a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the tag converter.

Fig. 2B illustrates a flow diagram of an inference phase in an audio compensation method according to an embodiment of the application. As shown in fig. 2B, the audio compensation method according to the embodiment of the present application further includes: an inference phase comprising the steps of: s210, acquiring second audio data transmitted to a second earphone; s220, passing the oscillogram of the second audio data through the trained convolutional neural network to obtain a feature matrix to be decoded; and S230, passing the characteristic matrix to be decoded through a generator to generate a hearing compensation curve corresponding to the second ear.

Fig. 3A illustrates an architecture diagram of a training phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 3A, in the training phase, first, in the network architecture, a waveform diagram of the first audio data (e.g., P1 as illustrated in fig. 3A) and a waveform diagram of the second audio data (e.g., P2 as illustrated in fig. 3A) are respectively passed through a convolutional neural network of a Taming converter (e.g., CNN as illustrated in fig. 3A) to obtain a first feature matrix (e.g., MF1 as illustrated in fig. 3A) and a second feature matrix (e.g., MF2 as illustrated in fig. 3A); then, calculating a difference-by-position between the first feature matrix and the second feature matrix to obtain a difference feature matrix (e.g., MF as illustrated in fig. 3A); then, passing the barycentric change data of the person within the preset time period (e.g., Q as illustrated in fig. 3A) through a context encoder including an embedded layer of the tag converter (e.g., E as illustrated in fig. 3A) to convert the barycentric change data of the person within the preset time period into a plurality of feature vectors (e.g., V as illustrated in fig. 3A), and two-dimensionally arranging the plurality of feature vectors to obtain a third feature matrix (e.g., MF3 as illustrated in fig. 3A); next, a euclidean distance between the difference feature matrix and the third feature matrix is calculated as a first loss term (e.g., LI1 as illustrated in fig. 3A); then, manifold dimension distribution similarity between the difference feature matrix and the third feature matrix is calculated as a second loss term (e.g., LI2 as illustrated in fig. 3A); and, finally, calculating a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the tag converter.

Fig. 3B illustrates an architecture diagram of an inference phase in an audio compensation method according to an embodiment of the present application. As shown in fig. 3B, in the inference phase, first, a waveform diagram (e.g., P as illustrated in fig. 3B) of the obtained second audio data is passed through the trained convolutional neural network (e.g., CN as illustrated in fig. 3B) to obtain a feature matrix to be decoded (e.g., MF as illustrated in fig. 3B); and then, passing the feature matrix to be decoded through a generator (e.g., GE as illustrated in fig. 3B) to generate a hearing compensation curve corresponding to the second ear.

More specifically, in the training phase, in step S110 and step S120, first audio data propagated to a first headphone and second audio data propagated from the first headphone to a second headphone within a preset time period are acquired, and center of gravity change data of a person within the preset time period is acquired. As mentioned above, since audio data is first transmitted from an audio output device (e.g. a smartphone) to a first earpiece and then from the first earpiece to a second earpiece, which results in a propagation offset between the audio data transmitted to the first earpiece and the audio data transmitted to the second earpiece, and this offset may cause different deviations for people wearing hearing aids during sports, it is desirable to invoke an audio compensation curve in music mode to eliminate motion noise so that the left and right ears can continuously experience the effect of stereo. This is essentially a regression problem, i.e. a hearing compensation curve is intelligently regressed based on the audio data transmitted to the first earpiece and the audio data transmitted to the second earpiece to compensate the audio signal of the second ear, so that the effects of motion noise can be eliminated.

Specifically, in the technical scheme of the application, first audio data transmitted to a first earphone by an audio output device in a preset time period and second audio data transmitted to a second earphone by the first earphone are firstly acquired from an earphone worn by a moving human body, and gravity center change data of the human body in the preset time period is acquired through a sensor arranged in the earphone. Here, the audio output device includes, but is not limited to, a smart phone, a smart band, and the like.

More specifically, in the training phase, in steps S130 and S140, the waveform map of the first audio data and the waveform map of the second audio data are respectively passed through a convolutional neural network of the learning converter to obtain a first feature matrix and a second feature matrix, and a position-wise difference between the first feature matrix and the second feature matrix is calculated to obtain a difference feature matrix. It should be understood that, in order to extract the high-dimensional correlation features of the audio signal data obtained by the first headphone and the audio signal data of the second headphone in the time sequence dimension, in the technical solution of the present application, the waveform diagram of the first audio data and the waveform diagram of the second audio data are further processed through a convolutional neural network of a Taming converter, respectively, so as to extract the high-dimensional local feature distribution of the waveform diagrams of the first and second audio data, respectively, thereby obtaining a first feature matrix and a second feature matrix. Then, a difference feature matrix can be obtained by calculating the position-wise difference between the first feature matrix and the second feature matrix, so that the difference feature of the audio feature of the first earphone and the audio feature of the second earphone can be obtained, and the hearing compensation curve of the second ear can be extracted better in the following process.

Specifically, in this embodiment of the present application, the process of passing the waveform diagram of the first audio data and the waveform diagram of the second audio data through the convolutional neural network of the tag converter to obtain the first feature matrix and the second feature matrix respectively includes: performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward transmission of layers by using each layer of a convolutional neural network of the tag converter to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward transmission of layers by using each layer of the convolutional neural network of the tag converter so as to output the second feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the second audio data.

More specifically, in the training phase, in step S150, the barycentric change data of the person within the preset time period is passed through the context encoder including the embedded layer of the timing converter to convert the barycentric change data of the person within the preset time period into a plurality of eigenvectors, and the plurality of eigenvectors are two-dimensionally arranged to obtain a third feature matrix. That is, further, in the technical solution of the present application, the obtained barycentric change data of the person within the preset time period is subjected to an encoding process in a context encoder including an embedded layer of the Taming converter, so as to convert the barycentric change data of the person within the preset time period into a plurality of feature vectors having global barycentric feature correlation information. Then, the plurality of feature vectors are two-dimensionally arranged to integrate feature correlation information of the change of the center of gravity of the person over all the preset time periods, thereby obtaining a third feature matrix.

Specifically, in this embodiment of the present application, the process of passing the barycentric change data of the person within the preset time period through the context encoder of the timing converter, which includes the embedded layer, to convert the barycentric change data of the person within the preset time period into a plurality of feature vectors includes: firstly, the gravity center change data of the person in the preset time period is respectively converted into input vectors by using an embedding layer of a context encoder of the tag converter so as to obtain a sequence of the input vectors. Then, a converter of a context encoder model of the tag converter is used to globally context-based semantic encode the sequence of input vectors to obtain the plurality of feature vectors.

More specifically, in the training phase, in step S160, the euclidean distance between the differential feature matrix and the third feature matrix is calculated as a first loss term. That is, in the present invention, in order to eliminate motion noise generated during motion, the difference feature matrix M is minimized ₂ And the third feature matrix M ₃ Euclidean distance between them to learn the contextual expression of the discrete data and to confidently reconstruct the image semantics in time series.

Specifically, in the embodiment of the present application, the process of calculating the euclidean distance between the differential feature matrix and the third feature matrix as the first loss term includes: calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein the formula is:

argmin||M ₂ -M ₃ || ₂

wherein M is ₂ Is the second feature matrix, M ₃ Is the third feature matrix.

More specifically, in the training phase, in step S170, manifold dimension distribution similarity between the difference feature matrix and the third feature matrix is calculated as a second loss term, the manifold dimension distribution similarity being determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a second norm of a difference matrix between the difference feature matrix and the third feature matrix. It should be understood that the Taming model is based on minimizing the difference feature matrix M ₂ And the third feature matrix M ₃ While the context expression of discrete data is learned and the image semantics are confidently reconstructed according to the time sequence, in order to improve the dependency of the feature expression of the differential feature matrix on the feature expression of the third feature matrix, it is necessary to further constrain different pipelines (pipeline) of the tag model in different dimensions and local view angles.

That is, specifically, in the technical solution of the present application, for the difference feature matrix M obtained by the timing model ₂ And the third feature matrix M ₃ Outside of minimizing the Euclidean distance between the two, i.e. argmin M ₂ -M ₃ || ₂ An additional term is further added that minimizes the similarity of the manifold dimension distribution between the two.

Specifically, in this embodiment of the present application, calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term includes: calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term according to the following formula;

wherein the formula is:

wherein cos (M) ₁ ，M ₂ ) Representing the first feature matrix M ₁ And the second feature matrix M ₂ The cosine distance between.

More specifically, in the training phase, in step S180, a weighted sum of the first loss term and the second loss term is calculated as a loss function value to train the convolutional neural network of the tag converter. That is, in the solution of the present application, after obtaining the first loss term and the second loss term, the convolutional neural network of the tag converter is further trained with a weighted sum of the two as a loss function value. It should be understood that, in this way, the corrected difference matrix obtained by the Taming model not only can reconstruct the image semantics of the waveform according to time sequence confidence through the context expression of discrete data, but also can constrain the association of local feature descriptions between feature matrices through the geometric similarity of distribution by the distribution similarity of the feature expression used for describing the difference feature matrix and the feature expression of the third feature matrix respectively under different dimensional view angles in the high-dimensional feature space, so as to enhance the dependency between feature expressions in a training process of the Taming model in an autoregressive manner.

Specifically, in the embodiment of the present application, the process of calculating a weighted sum of the first loss term and the second loss term as a loss function value to train the convolutional neural network of the tag converter includes: training a convolutional neural network of the tag converter by calculating a weighted sum of the first loss term and the second loss term as a loss function value in the following formula;

wherein the formula is:

where α and β are weighted hyperparameters and are initially set to α > β.

After training is completed, the inference phase is entered. That is, in the inference process, the second audio data of the second earphone may be directly input into the trained convolutional neural network for feature extraction, so as to obtain a feature matrix to be decoded, and then the feature matrix is decoded and regressed by the generator to obtain the hearing compensation curve corresponding to the second ear.

Specifically, in the present embodiment, first, the second audio data propagated to the second headphone is acquired. Then, the oscillogram of the second audio data passes through the trained convolutional neural network to obtain a feature matrix to be decoded. Finally, the feature matrix to be decoded is passed through a generator to generate a hearing compensation curve corresponding to a second ear.

In summary, the audio compensation method based on the embodiment of the present application is illustrated, which extracts high-dimensional local feature distributions of audio waveform diagrams of the first and second earphones in a time dimension through a convolutional neural network of a timing converter, obtains differences of the two in a manner of doing difference to obtain a differential feature matrix, and excavates global correlation features of gravity center change data of a person in the preset time period through an encoder to obtain a third feature matrix, further reconstructs image semantics of a waveform according to time sequence confidence through context expression of discrete data to constrain correlation of local feature descriptions among feature matrices through geometric similarities of distributions by using feature expressions for describing the differential feature matrix and feature expressions of the third feature matrix respectively in different dimensions view angles in a high-dimensional feature space, thereby strengthening the dependency between the feature expressions in an autoregressive way in the training process of the Taming model.

Exemplary System

FIG. 4 illustrates a block diagram of an audio compensation system according to an embodiment of the application. As shown in fig. 4, the audio compensation system 400 according to the embodiment of the present application includes: a training module 410 and an inference module 420.

As shown in fig. 4, the training module 410 includes: an audio data acquiring unit 411, configured to acquire first audio data that is transmitted to a first headphone and second audio data that is transmitted from the first headphone to a second headphone within a preset time period; a center of gravity data acquisition unit 412 for acquiring center of gravity change data of the person within the preset time period; a feature extraction unit 413, configured to pass the waveform of the first audio data obtained by the audio data obtaining unit 411 and the waveform of the second audio data obtained by the audio data obtaining unit 411 through a convolutional neural network of a tag converter to obtain a first feature matrix and a second feature matrix, respectively; a difference unit 414 configured to calculate a difference by position between the first feature matrix obtained by the feature extraction unit 413 and the second feature matrix obtained by the feature extraction unit 413 to obtain a difference feature matrix; an encoding unit 415, configured to pass the center of gravity change data of the person within the preset time period, obtained by the center of gravity data obtaining unit 412, through a context encoder including an embedded layer of the tag converter to convert the center of gravity change data of the person within the preset time period into a plurality of eigenvectors, and two-dimensionally arrange the plurality of eigenvectors to obtain a third eigenvector; a first loss term calculation unit 416 configured to calculate a euclidean distance between the difference feature matrix obtained by the difference unit 414 and the third feature matrix obtained by the encoding unit 415 as a first loss term; a second loss term calculation unit 417 configured to calculate, as a second loss term, manifold dimension distribution similarity between the difference feature matrix obtained by the difference unit 414 and the third feature matrix obtained by the encoding unit 415, where the manifold dimension distribution similarity is determined based on a ratio between a cosine distance between the difference feature matrix and the third feature matrix and a second norm of a difference matrix between the difference feature matrix and the third feature matrix; and a training unit 418 configured to calculate a weighted sum of the first loss term obtained by the first loss term calculation unit 416 and the second loss term obtained by the second loss term calculation unit 417 as a loss function value to train the convolutional neural network of the tag converter.

As shown in fig. 4, the inference module 420 includes: an inferred audio data acquisition unit 421 for acquiring second audio data propagated to the second headphone; a decoding feature matrix generating unit 422, configured to pass the oscillogram of the second audio data obtained by the inferred audio data obtaining unit 421 through the trained convolutional neural network to obtain a feature matrix to be decoded; and a decoding unit 423 for passing the feature matrix to be decoded obtained by the decoding feature matrix generating unit 422 through a generator to generate a hearing compensation curve corresponding to a second ear.

In an example, in the above audio compensation system 400, the feature extraction unit 413 is further configured to: performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward transmission of layers by using each layer of a convolutional neural network of the tag converter to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward pass of layers by using each layer of the convolutional neural network of the tag converter to output the second feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the second audio data.

In an example, in the audio compensation system 400, the encoding unit 415 is further configured to: converting the gravity center change data of the people in the preset time period into input vectors by using an embedded layer of a context encoder of the tag converter so as to obtain a sequence of the input vectors; and globally context-based semantic encoding the sequence of input vectors using a converter of a context encoder model of the tag converter to obtain the plurality of feature vectors.

In one example, in the above audio compensation system 400, the first loss term calculating unit 416 is further configured to: calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein the formula is:

argmin‖M ₂ -M ₃ ‖ ₂

wherein M is ₂ For the difference feature matrix, M ₃ Is the third feature matrix.

In one example, in the audio compensation system 400, the second loss term calculation unit 417 is further configured to: calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term according to the following formula;

wherein the formula is:

wherein cos (M) ₂ ,M ₃ ) Representing the differentiated feature matrix M ₂ And the third feature matrix M ₃ The cosine distance between.

In one example, in the above audio compensation system 400, the training unit 418 is further configured to: training a convolutional neural network of the tag converter by calculating a weighted sum of the first loss term and the second loss term as a loss function value in the following formula;

wherein the formula is:

where α and β are weighted hyperparameters and are initially set to α > β.

Here, it can be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the audio compensation system 400 have been described in detail in the above description of the audio compensation method with reference to fig. 1 to 3B, and thus, a repetitive description thereof will be omitted.

As described above, the audio compensation system 400 according to the embodiment of the present application may be implemented in various terminal devices, such as a server of an audio compensation algorithm. In one example, the audio compensation system 400 according to the embodiments of the present application may be integrated into the terminal device as one software module and/or hardware module. For example, the audio compensation system 400 may be a software module in the operating means of the terminal device, or may be an application developed for the terminal device; of course, the audio compensation system 400 may also be one of many hardware modules of the terminal device.

Alternatively, in another example, the audio compensation system 400 and the terminal device may be separate devices, and the audio compensation system 400 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to the agreed data format.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 5. As shown in fig. 5, the electronic device 10 includes one or more processors 11 and memory 12. The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the functions of the audio compensation method of the various embodiments of the present application described above and/or other desired functions. Various contents such as the second feature matrix, the first loss item, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus device and/or other form of connection mechanism (not shown).

The input device 13 may include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information to the outside, including a hearing compensation curve corresponding to the second ear, and the like. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 5, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the functions in the audio compensation method according to various embodiments of the present application described in the "exemplary methods" section of this specification above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the audio compensation method described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. An audio compensation method, comprising:

a training phase comprising:

an inference phase comprising:

obtaining second audio data propagated to a second headphone;

2. The audio compensation method of claim 1, wherein passing the waveform map of the first audio data and the waveform map of the second audio data through a convolutional neural network of a tag converter to obtain a first feature matrix and a second feature matrix, respectively, comprises:

performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward transmission of layers by using each layer of a convolutional neural network of the tag converter to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and

performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data in forward pass of layers respectively by using each layer of the convolutional neural network of the tag converter so as to output the second feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the second audio data.

3. The audio compensation method of claim 2, wherein passing the barycentric change data of the person within the preset time period through a context encoder of the tag converter including an embedded layer to convert the barycentric change data of the person within the preset time period into a plurality of feature vectors comprises:

converting the gravity center change data of the people in the preset time period into input vectors by using an embedded layer of a context encoder of the tag converter so as to obtain a sequence of the input vectors; and

globally context-based semantic encoding the sequence of input vectors using a converter of a context encoder model of the taping converter to obtain the plurality of feature vectors.

4. The audio compensation method of claim 3, wherein calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss term comprises:

calculating a Euclidean distance between the difference feature matrix and the third feature matrix as a first loss term according to the following formula;

wherein the formula is:

argmin‖M ₂ -M ₃ ‖ ₂

5. The audio compensation method of claim 4, wherein calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term comprises:

calculating manifold dimension distribution similarity between the difference feature matrix and the third feature matrix as a second loss term according to the following formula;

wherein the formula is:

wherein cos (M) ₂ ,M ₃ ) Representing the difference feature matrix M ₂ And the third feature matrix M ₃ The cosine distance between.

6. The audio compensation method of claim 5, wherein computing a weighted sum of the first loss term and the second loss term as a loss function value to train a convolutional neural network of the tag converter comprises:

training a convolutional neural network of the tag converter by calculating a weighted sum of the first loss term and the second loss term as a loss function value in the following formula;

wherein the formula is:

where α and β are weighted hyperparameters and are initially set to α > β.

7. An audio compensation system, comprising:

a training module comprising:

an inference module comprising:

8. The audio compensation system of claim 7, wherein the feature extraction unit is further configured to:

performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward transmission of layers by using each layer of a convolutional neural network of the tag converter to output the first feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the first audio data; and performing convolution processing based on a two-dimensional convolution kernel, pooling processing along a channel dimension and activation processing on input data respectively in forward pass of layers by using each layer of the convolutional neural network of the tag converter to output the second feature matrix by the last layer of the convolutional neural network, wherein the input of the first layer of the convolutional neural network is a waveform diagram of the second audio data.

9. The audio compensation system of claim 7, wherein the encoding unit is further configured to:

converting the gravity center change data of the people in the preset time period into input vectors by using an embedded layer of a context encoder of the tag converter so as to obtain a sequence of the input vectors; and globally context-based semantic encoding the sequence of input vectors using a converter of a context encoder model of the tag converter to obtain the plurality of feature vectors.

10. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the audio compensation method of any of claims 1-6.