CN117319886B

CN117319886B - Method and system for reducing audio path time delay of Android system

Info

Publication number: CN117319886B
Application number: CN202311604791.4A
Authority: CN
Inventors: 章洪亮; 肖忠山; 龚利恒
Original assignee: Shenzhen Zhangrui Electronic Co ltd
Current assignee: Shenzhen Zhangrui Electronic Co ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-03-12
Anticipated expiration: 2043-11-29
Also published as: CN117319886A

Abstract

A method and a system for reducing the time delay of an audio channel of an Android system capture sound input signals through a microphone; extracting bottom data containing the sound input signal from the USB DONGLE and transmitting the bottom data to a computing device, wherein the computing device is provided with a driver, and the driver is used for ensuring compatibility between the microphone and the USB DONGLE; identifying, in the computing device, the sound input signal from the underlying data; digital signal processing is carried out on the sound input signal to obtain a processed sound input signal; and playing the processed sound input signal through a speaker of the computing device. In this way unnecessary data transmission can be avoided, thereby reducing the burden and delay of processing.

Description

Method and system for reducing audio path time delay of Android system

Technical Field

The application relates to the technical field of intelligent audio, and in particular relates to a method and a system for reducing audio path time delay of an Android system.

Background

The delay problem is encountered when K songs are mainly caused by the long overall audio transmission link. Typically, the commercial product solution uses a combination of a hand-held microphone, a USB DONGLE and a host system to implement the K song function. Sound is transmitted from the microphone to the host system via the USB DONGLE and then finally output to the speaker through a series of drivers, audio hardware abstraction layers (Audio HAL), frames (frame work), audio hardware abstraction layers (Audio HAL), drivers and DSP (digital signal processor).

However, due to the limitation of the android system, the use of the system layer to call the API of the AudioTrack and other frameworks to implement the backhaul audio function may result in a delay of 200-300 ms or even longer, which makes the K song function not normally usable. Even if optimized, the delay is over 50 milliseconds, resulting in a K song experience that is still not ideal.

Therefore, a scheme for reducing the audio path delay of the Android system is expected.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a method and a system for reducing the time delay of an audio channel of an Android system, which capture a sound input signal through a microphone; extracting bottom data containing the sound input signal from the USB DONGLE and transmitting the bottom data to a computing device, wherein the computing device is provided with a driver, and the driver is used for ensuring compatibility between the microphone and the USB DONGLE; identifying, in the computing device, the sound input signal from the underlying data; digital signal processing is carried out on the sound input signal to obtain a processed sound input signal; and playing the processed sound input signal through a speaker of the computing device. In this way unnecessary data transmission can be avoided, thereby reducing the burden and delay of processing.

In a first aspect, a method for reducing audio path delay of an Android system is provided, which includes:

capturing an acoustic input signal by a microphone;

extracting bottom data containing the sound input signal from the USB DONGLE and transmitting the bottom data to a computing device, wherein the computing device is provided with a driver, and the driver is used for ensuring compatibility between the microphone and the USB DONGLE;

identifying, in the computing device, the sound input signal from the underlying data;

digital signal processing is carried out on the sound input signal to obtain a processed sound input signal; and

the processed sound input signal is played through a speaker of the computing device.

In a second aspect, a system for reducing audio path delay of an Android system is provided, which includes:

an input signal capturing module for capturing a sound input signal through a microphone;

the bottom data extraction module is used for extracting bottom data containing the sound input signals from the USB DONGLE and transmitting the bottom data to the computing equipment, wherein the computing equipment is provided with a driving program, and the driving program is used for ensuring compatibility between the microphone and the USB DONGLE;

a signal recognition module for recognizing, in the computing device, the sound input signal from the underlying data;

the digital signal processing module is used for carrying out digital signal processing on the sound input signal to obtain a processed sound input signal; and

and the playing module is used for playing the processed sound input signal through a loudspeaker of the computing device.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a K song audio link according to an embodiment of the present application.

Fig. 2 is a flowchart of a method for reducing an audio path delay of an Android system according to an embodiment of the present application.

Fig. 3 is a schematic architecture diagram of a method for reducing audio path delay of an Android system according to an embodiment of the present application.

Fig. 4 is a flowchart of the substeps of step 130 in the method for reducing the audio path delay of the Android system according to the embodiment of the application.

Fig. 5 is a block diagram of a system for reducing Android system audio path delay according to an embodiment of the present application.

Fig. 6 is a schematic view of a scenario of a method for reducing an audio path delay of an Android system according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application.

In the description of the embodiments of the present application, unless otherwise indicated and defined, the term "connected" should be construed broadly, and for example, may be an electrical connection, may be a communication between two elements, may be a direct connection, or may be an indirect connection via an intermediary, and it will be understood by those skilled in the art that the specific meaning of the term may be understood according to the specific circumstances.

It should be noted that, the term "first\second\third" in the embodiments of the present application is merely to distinguish similar objects, and does not represent a specific order for the objects, it is to be understood that "first\second\third" may interchange a specific order or sequence where allowed. It is to be understood that the "first\second\third" distinguishing objects may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in sequences other than those illustrated or described herein.

It should be appreciated that the original audio path link includes: microphone- > USB DONGLE- > driver- > Audio hardware abstraction layer (Audio HAL) - > Framework- > Audio hardware abstraction layer (Audio HAL) - > driver- > Digital Signal Processor (DSP) - > speaker.

Wherein, 1. Microphone: microphones are devices used to capture sound input signals, converting sound into electrical signals for subsequent processing. USB DONGLE: a USB DONGLE is a device that connects a microphone and a computing device. Is responsible for extracting the underlying data containing the sound input signal and transmitting it to the computing device. 3. The driving program comprises the following steps: the driver is software installed on the computing device to ensure compatibility between the microphone and the USB DONGLE. Is responsible for managing data transmissions and device communications. 4. Audio hardware abstraction layer (Audio HAL): the Audio HAL is a software layer that provides an interface to interact with Audio hardware. Acts as a bridge between the audio hardware and the operating system, handling the input and output of audio data. 5. Frame (frame): the framework is part of the Android system and provides audio processing and management functions. Interact with applications and system components, control the routing of audio streams, volume adjustments, audio effects, etc. 6. Digital Signal Processor (DSP): a DSP is a dedicated hardware or software component for digital signal processing of an audio signal. Algorithms such as audio effect processing, filtering, noise reduction, etc. are performed to improve audio quality. 7. A loudspeaker: a speaker is a device for playing a processed sound input signal. The digital signal is converted into a sound signal so that the user can hear the sound.

Through this link, the audio signal is processed and transmitted from the microphone through a series of processes and finally played out through the speaker. Each link has specific functions and roles, and from hardware equipment to a software layer, the functions of audio communication and media playing are realized in a cooperative mode.

Further, the audio path delay refers to the delay time that the audio signal experiences from input to output. In the K song application, the audio signal needs to be processed and transmitted through a plurality of links, which may cause a longer audio path delay and bring bad use experience to the user.

First, there may be some delay in the transmission from the microphone to the USB DONGLE because the audio signal needs to be converted and amplified by the microphone and then digitized and transmitted over the USB DONGLE. Both these conversion and transmission processes require a certain time, resulting in the generation of delays. Secondly, the transmission from the USB DONGLE to the host system may introduce a certain delay, and the USB DONGLE needs to transmit the digitized audio signal to the host system, and the transmission process may be limited by the USB interface, which may cause an increase in delay. Then, the driver, audio hardware abstraction layer (Audio HAL) and framework (framework) in the host system process and transmit the Audio signal, and these processes also introduce a certain delay. Finally, the audio signal is finally output to a speaker after being driven and processed by a DSP (digital signal processor). There may also be some delay in this output process, particularly if the speaker device response speed is slow, the delay problem may be more pronounced.

The audio path delay problem in K song applications is mainly caused by the processing and transmission of multiple links. To improve this problem, measures such as optimizing hardware devices, updating drivers and frameworks, reducing signal transmission distances, etc. can be taken to reduce delays and enhance the user experience.

Aiming at the technical problems, the technical concept of the application is to remove redundant links by analyzing the K song audio link. Fig. 1 is a K song audio link according to an embodiment of the present application, and as shown in fig. 1, the original link is reduced from "microphone- > USB DONGLE- > driver- > audio HAL- > frame work- > audio HAL- > driver- > DSP- > spin" to "microphone- > USB DONGLE- > driver- > DSP- > spin". And analyzing USB bottom layer data in a Loopback algorithm in the driving layer, separating out needed audio data, and directly sending the audio data to the DSP for output. And removing links of equal-length time delay of audio HAL- > frame work- > audio HAL. The total time delay can reach about 35ms, and the human ear is free of sense.

That is, there are multiple redundant links in the original audio link, such as audio HAL, frame work, audio HAL, etc., which introduce additional processing and transmission delays. By eliminating these links, the latency of the system can be reduced. And a Loopback algorithm is arranged in the driving layer, so that the USB bottom data can be analyzed immediately, and the needed audio data can be separated. This avoids unnecessary data being transferred to the DSP, thereby reducing the burden and delay of processing. The improved audio link is 'microphone- > USB DONGLE- > driver- > DSP- > spin', and the link is more concise and clear. The audio data is transmitted from the microphone to the driver through the USB DONGLE, then directly sent to the DSP for processing, and finally output to the loudspeaker. Such a link structure can greatly reduce latency in transmission and processing. With such an improvement, the total delay can reach around 35 ms. This delay is imperceptible to the human ear and may provide a better audio experience. That is, the traditional scheme reads the mic node, the audio data is not needed to be separated, and an Android-ready loopback interface is used, so that the implementation is simple. The method starts from the driver layer, analyzes the USB protocol, separates and extracts the needed audio data, newly makes a loopback interface, and shortens a data transmission link.

Accordingly, in the technical solution of the present application, fig. 2 is a flowchart of a method for reducing audio path delay of an Android system according to an embodiment of the present application. As shown in fig. 2, the method for reducing the audio path delay of the Android system includes: 110 capturing an acoustic input signal by a microphone; 120 extracting bottom data containing the sound input signal from the USB DONGLE and transmitting the bottom data to a computing device, wherein the computing device is installed with a driver, and the driver is used for ensuring compatibility between the microphone and the USB DONGLE; 130, in the computing device, identifying the sound input signal from the underlying data; 140, performing digital signal processing on the sound input signal to obtain a processed sound input signal; and, 150, playing the processed sound input signal through a speaker of the computing device.

Wherein in said step 110 a high quality microphone is selected to ensure accurate capture of the sound input signal. The quality of the microphone will directly affect the accuracy and clarity of the sound input signal in subsequent steps.

In step 120, the compatibility of the USB DONGLE with the computing device is ensured, and a corresponding driver is installed. By using a USB DONGLE, the voice input signal can be transmitted to the computing device, thereby enabling digitization of the audio path.

In said step 130, appropriate signal processing and algorithm design is performed to ensure accurate recognition of the sound input signal. By identifying the voice input signals from the underlying data, real-time voice identification and processing can be realized, and user experience is improved.

In said step 140, the appropriate digital signal processing algorithms and parameters are selected to meet the requirements of the audio processing. Through digital signal processing, the sound input signal can be subjected to filtering, enhancement, noise reduction and other processing, so that the quality and definition of sound are improved.

In said step 150, a high quality speaker is selected and an appropriate audio output setting is made. By using high quality speakers for playback, it is ensured that the processed sound input signal is presented to the user in a high quality audio form, improving the user experience.

Through the above steps, a high quality microphone and USB DONGLE having low delay characteristics are selected, compatibility between the microphone and USB DONGLE is ensured, and USB DONGLE supporting low delay transmission is selected. The latest drivers are installed and ensure that the drivers are compatible with the Android system, using an audio framework that supports low latency audio processing, such as OpenSL ES or AAudio. The physical distance between the microphone and the USB DONGLE is shortened as much as possible, so as to reduce the delay of signal transmission and avoid using an overlong connecting wire or an extension line. A professional audio interface, such as a USB audio interface or an external audio device, is used to provide a more stable and low-latency audio transmission. When digital signal processing is performed on an audio input signal, some optimization algorithms, such as noise reduction, echo cancellation, and audio compression, may be used to reduce delays in the processing.

Therefore, the time delay of an audio channel of the Android system can be reduced, and the quality and the instantaneity of a sound input signal can be improved. This will help improve the performance and user experience of applications such as audio communication, speech recognition and media playback.

In the method for reducing the audio path delay of the Android system, it is critical that the sound input signal is rapidly and accurately identified from the underlying data. It should be noted that in the technical solution of the present application, a mode difference exists between the voice input signal and other signals or other data, so in the technical solution of the present application, after the underlying data is obtained, a mode feature of the data to be identified may be extracted, and then, based on the mode feature, a data type determination is performed to determine whether the data to be identified is the voice input signal.

Fig. 3 is a schematic architecture diagram of a method for reducing audio path delay of an Android system according to an embodiment of the present application. Fig. 4 is a flowchart of the substeps of step 130 in the method for reducing the audio path delay of the Android system according to the embodiment of the application. As shown in fig. 3 and 4, in the computing device, identifying the sound input signal from the underlying data includes: 131, extracting data to be identified from the bottom layer data; 132, performing frequency domain analysis on the data to be identified to obtain a plurality of frequency domain feature statistic values; 133, passing the plurality of frequency domain feature statistics and the data to be identified through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix; and 134, determining whether the data to be identified is a sound input signal based on the multi-modal fusion feature matrix.

In step 131, a data segment containing the audio input signal is extracted according to the format and coding mode of the underlying data, which may be accomplished by parsing the data format, reading the data at a specific location, or applying a specific data extraction algorithm. It can be ensured that the correctly extracted data segment contains the complete sound input signal and is not damaged or lost.

In the step 132, the data to be identified is subjected to frequency domain analysis, for example, fast Fourier Transform (FFT) is applied, so as to obtain characteristic information in the frequency domain. A plurality of frequency domain feature statistics may be calculated for a spectrogram, a frequency domain energy distribution, a frequency feature, etc. Proper frequency domain analysis method and parameters are selected to ensure that accurate and representative frequency domain characteristic statistical values are obtained.

In the step 133, the cross-modal feature encoder is used to perform feature fusion on the frequency domain feature statistics and the data to be identified, so as to generate a multi-modal fusion feature matrix. The cross-modality feature encoder may include a sequence encoder and an image encoder for extracting and fusing different types of features. The method selects a proper cross-modal feature encoder, ensures that a plurality of features are effectively fused, and acquires a multi-modal feature matrix with distinguishing and characterization capabilities.

In the step 134, the multi-modal fusion feature matrix is analyzed and judged by using a machine learning algorithm, a pattern recognition method or a classifier to determine whether the data to be recognized is a voice input signal. A suitable classification algorithm or model is selected, and training and verification is performed to ensure accurate determination of the presence or absence of the acoustic input signal.

By utilizing a plurality of frequency domain feature statistics values and a cross-modal feature encoder, the accurate recognition rate of the sound input signals can be improved, and therefore the effects of audio processing and analysis are improved.

Specifically, in the technical solution of the present application, in the computing device, a process of identifying, from the underlying data, the sound input signal includes: firstly, extracting data to be identified from the bottom layer data; then, the data to be identified is subjected to frequency domain analysis to obtain a plurality of frequency domain feature statistics, for example, the data to be identified is subjected to frequency domain analysis using a frequency domain feature extractor based on Fourier transform to obtain the plurality of frequency domain feature statistics.

Among them, fourier transform is an important signal processing technique for decomposing a signal into a series of sine and cosine functions of different frequencies, and in audio processing, fourier transform is commonly used for analyzing and processing an audio signal.

The basic idea of fourier transform is to transform a signal in one time domain into the frequency domain. In the time domain, the signal is time-varying, while in the frequency domain, the signal is frequency-varying. The fourier transform may transform a signal from a time domain representation to a frequency domain representation, enabling analysis of the components of the signal at different frequencies. The fourier transform may be implemented by a continuous fourier transform (CTFT) or a Discrete Fourier Transform (DFT), the CTFT being suitable for spectral analysis of continuous signals and the DFT being suitable for spectral analysis of discrete signals.

In audio processing, an audio signal is typically analyzed using a DFT that converts a discrete time domain signal into a discrete frequency domain signal, where the frequency ranges from 0 to half the sampling rate. The output of the DFT is called the spectrum and represents the intensity of the different frequency components in the signal. By fourier transforming the audio signal, spectral information of the signal, including frequency components, frequency intensity, etc., can be obtained. This is very useful for many tasks in audio processing, such as audio coding, noise reduction, filtering, etc.

Fourier transform is a method of converting a signal from a time domain to a frequency domain, is widely used in audio processing, and can analyze and process spectral characteristics of an audio signal. And then, the frequency domain feature statistic values and the data to be identified are passed through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix. Passing the plurality of frequency domain feature statistics and the data to be identified through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix, comprising: inputting the plurality of frequency domain feature statistics into a sequence encoder of the cross-modal feature encoder to obtain frequency domain feature vectors; the data to be identified passes through an image encoder of the cross-modal feature encoder to obtain a data graph feature vector to be identified; and performing cross-modal coding on the frequency domain statistical feature vector and the data graph feature vector to be identified to obtain the multi-modal fusion feature matrix.

It should be appreciated that if data type identification is performed from only features of a single modality, accuracy of data type identification may be degraded due to insufficient information or insufficient significance of modality features. Therefore, in the technical solution of the present application, the cross-modal feature encoder including the sequence encoder and the image encoder is used to perform multi-modal association encoding on the plurality of frequency domain feature statistics and the data to be identified so as to extract multi-modal feature representations of the data to be identified.

In particular, in the technical scheme of the application, the sequence encoder is a one-dimensional convolutional neural network model, and the image encoder is a two-dimensional convolutional neural network model, so that in the encoding process of the data to be identified and the frequency domain feature statistic values through the cross-modal feature encoder comprising the sequence encoder and the image encoder, the correlation mode features between the frequency domain statistic features and the sound waveform features of the data to be identified can be extracted, and the correlation mode features and the sound waveform features of the data to be identified are subjected to multi-mode feature fusion to obtain the multi-mode fusion feature matrix.

In a specific example, the process of passing the plurality of frequency domain feature statistics and the data to be identified through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix includes: firstly, inputting the frequency domain feature statistics into a sequence encoder of the cross-modal feature encoder to obtain frequency domain feature vectors; then, the data to be identified passes through an image encoder of the cross-modal feature encoder to obtain a data graph feature vector to be identified; and further, performing cross-modal coding on the frequency domain statistical feature vector and the data graph feature vector to be identified to obtain the multi-modal fusion feature matrix.

Further, the frequency domain feature statistics provide information about the frequency characteristics of the signal, while the data to be identified contains more raw information. By cross-modal coding, the two different types of information can be fused, multiple characteristics are integrated, and the identification performance is improved.

The frequency domain feature statistics and the data pattern feature vectors to be identified are encoded by a sequence encoder and an image encoder, respectively, which can be converted into more characteristic representation forms. Cross-modal coding can capture the correlation and complementarity between different types of data, providing a richer, more comprehensive representation of features.

The cross-modal feature encoder may map the high-dimensional frequency domain feature statistics and the data pattern feature vectors to be identified into a low-dimensional multi-modal fusion feature matrix. This can reduce the dimensionality of the features, increase computational efficiency, and reduce the complexity of model training and reasoning.

The multi-mode fusion feature matrix contains rich information, and can better express the features of the data to be identified. By fusing the multi-mode information, the robustness and generalization capability of the model can be improved, and the recognition accuracy and performance can be improved. The multi-mode fusion feature matrix can be obtained by performing cross-mode coding on the frequency domain feature statistic value and the data to be identified, so that the beneficial effects of improving the identification performance and generalization capability are achieved.

Further, performing cross-modal encoding on the frequency domain statistical feature vector and the data graph feature vector to be identified to obtain the multi-modal fusion feature matrix, including: and calculating a vector product between the transpose vector of the to-be-identified data graph feature vector and the frequency domain statistical feature vector to obtain the multi-mode fusion feature matrix.

Further, based on the multi-modal fusion feature matrix, determining whether the data to be identified is a sound input signal includes: optimizing each characteristic value of the multi-mode fusion characteristic matrix to obtain an optimized multi-mode fusion characteristic matrix; and the optimized multi-mode fusion feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the data to be identified are voice input signals or not.

In particular, in the technical solution of the present application, the frequency domain statistical feature vector and the to-be-identified data graphic feature vector respectively express correlation pattern features between frequency domain statistical features and sound waveform features of the to-be-identified data. Therefore, if the source data distribution difference and the coding mode difference of the feature coder cause a relatively significant feature distribution difference between the frequency domain statistical feature vector and the data graph feature vector to be identified, when the position-by-position association between the frequency domain statistical feature vector and the data graph feature vector to be identified is calculated to perform cross-mode fusion coding, the frequency domain statistical feature vector and the data graph feature vector to be identified also have feature distribution domain transfer difference which is the cross-semantic distribution difference of multi-mode fusion feature representation.

Therefore, considering that the multi-modal fusion feature matrix has a feature distribution domain transfer difference of cross-semantic distribution difference of multi-modal fusion feature representation, namely, the feature values of all positions in the multi-modal fusion feature matrix have the difference among cross-modal semantic features, the multi-modal fusion feature matrix can cause poor convergence of probability density distribution of regression probability of all feature values of the multi-modal fusion feature matrix when the multi-modal fusion feature matrix is mapped through classifier class probability regression, and the accuracy of classification results obtained through a classifier is affected.

Therefore, the optimization is preferably performed on each eigenvalue of the multi-modal fusion eigenvalue matrix, specifically expressed as: optimizing each characteristic value of the multi-mode fusion characteristic matrix by the following formula to obtain the optimized multi-mode fusion characteristic matrix; wherein, the formula is:

wherein,is the multi-modal fusion feature matrix, < >>And->Is the multi-modal fusion feature matrix +.>Is>And->Characteristic value, and->Is the multi-modal fusion feature matrix +.>Global feature mean,/, of>Is the +.f. of the optimized multimodal fusion feature matrix>And characteristic values.

Specifically, the multi-modal fusion feature matrixLocal probability density mismatch of probability density distribution in probability space caused by sparse distribution in high-dimensional feature space, and the multi-modal fusion feature matrix is imitated by regularized global self-consistent class coding>Global self-consistent relation of coding behaviors of high-dimensional features in probability space to adjust error landscapes of feature manifold in high-dimensional open space domain, and realizing multi-mode fusion feature matrix ∈>Is used for encoding the self-consistent matching type embedded in the explicit probability space, thereby improving the multi-mode fusion feature matrix +.>The convergence of the probability density distribution of the regression probabilities of (2) improves the accuracy of the classification results obtained by the classifier.

After the optimized multi-mode fusion feature matrix is obtained, the optimized multi-mode fusion feature matrix is passed through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the data to be identified is the sound input signal. Further, by fusing the characteristics of a plurality of modes, the respective information of the modes can be comprehensively utilized, so that the performance and accuracy of the classifier are improved. Features of different modalities may have complementary information, and features of the data to be identified may be better captured by fusion.

The multi-mode fusion can enhance the robustness of the system to noise, variation and interference, and when the characteristics of one mode are interfered or unreliable, the characteristics of other modes can make up for the defects and provide more reliable classification results. Features of a single modality may not fully characterize the data to be identified. For example, in voice recognition, the use of audio features alone may not distinguish between certain similar sounds. By fusing features of other modalities, such as video or text features, more comprehensive information can be provided to better distinguish between different sounds.

The multi-mode fusion can enable the system to be more adaptive and can process different types of data to be identified. By fusing the characteristics of different modes, the system can better adapt to the characteristics of different data types and provide more accurate classification results. The optimized multi-modal fusion feature matrix is passed through a classifier to obtain a classification result, so that the classification accuracy can be remarkably improved, the robustness of the system can be enhanced, the limitation of a single mode can be solved, and the adaptability of the system can be improved. These beneficial effects are very important for the task of recognition of the acoustic input signal.

In summary, the method 100 for reducing the audio path delay of the Android system according to the embodiment of the present application is illustrated, and the redundant link is removed by analyzing the K song audio link. The original link is reduced from 'microphone- > USB DONGLE- > driver- > audio HAL- > frame work- > audio HAL- > driver- > DSP- > spin' to 'microphone- > USB DONGLE- > driver- > DSP- > spin'. And analyzing USB bottom layer data in a Loopback algorithm in the driving layer, separating out needed audio data, and directly sending the audio data to the DSP for output. And removing links of equal-length time delay of audio HAL- > frame work- > audio HAL. The total time delay can reach about 35ms, and the human ear is free of sense.

In one embodiment of the present application, fig. 5 is a block diagram of a system for reducing audio path delay of an Android system according to an embodiment of the present application. As shown in fig. 5, a system 200 for reducing audio path delay of an Android system according to an embodiment of the present application includes: an input signal capturing module 210 for capturing a sound input signal through a microphone; a bottom data extraction module 220, configured to extract bottom data including the sound input signal from the USB DONGLE and transmit the bottom data to a computing device, where the computing device is installed with a driver, and the driver is configured to ensure compatibility between the microphone and the USB DONGLE; a signal recognition module 230 for recognizing, in the computing device, the sound input signal from the underlying data; a digital signal processing module 240, configured to perform digital signal processing on the sound input signal to obtain a processed sound input signal; and a playing module 250 for playing the processed sound input signal through a speaker of the computing device.

In the system for reducing the audio path delay of the Android system, the signal identification module comprises: the identification data extraction unit is used for extracting data to be identified from the bottom layer data; the frequency domain analysis unit is used for carrying out frequency domain analysis on the data to be identified to obtain a plurality of frequency domain characteristic statistical values; the feature encoding unit is used for enabling the plurality of frequency domain feature statistics values and the data to be identified to pass through a cross-mode feature encoder comprising a sequence encoder and an image encoder to obtain a multi-mode fusion feature matrix; and a sound input signal determining unit, configured to determine whether the data to be identified is a sound input signal based on the multimodal fusion feature matrix.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above system for reducing the audio path delay of the Android system have been described in detail in the above description of the method for reducing the audio path delay of the Android system with reference to fig. 1 to 4, and thus, repetitive descriptions thereof will be omitted.

As described above, the system 200 for reducing the audio path delay of the Android system according to the embodiment of the application may be implemented in various terminal devices, for example, a server for reducing the audio path delay of the Android system. In one example, the system 200 for reducing audio path delay of the Android system according to the embodiments of the present application may be integrated into a terminal device as a software module and/or a hardware module. For example, the system 200 for reducing the audio path delay of the Android system may be a software module in the operating system of the terminal device, or may be an application program developed for the terminal device; of course, the system 200 for reducing the audio path delay of the Android system may also be one of a plurality of hardware modules of the terminal device.

Alternatively, in another example, the system 200 for reducing the audio path delay of the Android system and the terminal device may be separate devices, and the system 200 for reducing the audio path delay of the Android system may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to the agreed data format.

Fig. 6 is a schematic view of a scenario of a method for reducing an audio path delay of an Android system according to an embodiment of the present application. As shown in fig. 6, in this application scenario, first, data to be identified is extracted from the underlying data (e.g., C as illustrated in fig. 6); the obtained data to be identified is then input to a server (e.g., S as illustrated in fig. 6) deployed with an Android system based audio path delay reduction algorithm, wherein the server is capable of processing the data to be identified based on the Android system audio path delay reduction algorithm to determine whether the data to be identified is a sound input signal.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. The method for reducing the time delay of the audio path of the Android system is characterized by comprising the following steps of:

capturing an acoustic input signal by a microphone;

playing the processed sound input signal through a speaker of the computing device;

wherein in the computing device, identifying the sound input signal from the underlying data comprises:

extracting data to be identified from the bottom layer data;

carrying out frequency domain analysis on the data to be identified to obtain a plurality of frequency domain feature statistic values;

passing the plurality of frequency domain feature statistics and the data to be identified through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix; and

and determining whether the data to be identified is a sound input signal or not based on the multi-mode fusion feature matrix.

2. The method for reducing audio path delay of an Android system according to claim 1, wherein passing the plurality of frequency domain feature statistics and the data to be identified through a cross-modal feature encoder comprising a sequence encoder and an image encoder to obtain a multi-modal fusion feature matrix comprises:

inputting the plurality of frequency domain feature statistics into a sequence encoder of the cross-modal feature encoder to obtain frequency domain feature vectors;

the data to be identified passes through an image encoder of the cross-modal feature encoder to obtain a data graph feature vector to be identified;

and performing cross-modal coding on the frequency domain statistical feature vector and the data graph feature vector to be identified to obtain the multi-modal fusion feature matrix.

3. The method for reducing audio path delay of an Android system of claim 2, wherein said sequence encoder is a one-dimensional convolutional neural network model.

4. The method for reducing audio path delay in an Android system of claim 3, wherein said image encoder is a two-dimensional convolutional neural network model.

5. The method for reducing audio path delay of an Android system according to claim 4, wherein performing cross-modal encoding on the frequency domain statistical feature vector and the data graph feature vector to be identified to obtain the multi-modal fusion feature matrix comprises:

and calculating a vector product between the transpose vector of the to-be-identified data graph feature vector and the frequency domain statistical feature vector to obtain the multi-mode fusion feature matrix.

6. The method for reducing audio path delay in an Android system according to claim 5, wherein determining whether the data to be identified is a sound input signal based on the multi-modal fusion feature matrix comprises:

optimizing each characteristic value of the multi-mode fusion characteristic matrix to obtain an optimized multi-mode fusion characteristic matrix;

and the optimized multi-mode fusion feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the data to be identified are voice input signals or not.

7. The method for reducing audio path delay of an Android system according to claim 6, wherein optimizing each eigenvalue of the multi-modal fusion eigenvalue matrix to obtain an optimized multi-modal fusion eigenvalue matrix comprises: optimizing each characteristic value of the multi-mode fusion characteristic matrix by the following formula to obtain the optimized multi-mode fusion characteristic matrix;

wherein, the formula is:

wherein,is the multi-modal fusion feature matrix, < >>And->Is the multi-modal fusion feature matrix +.>Is>And->Characteristic value, and->Is the multi-modal fusion feature matrix +.>Global feature mean,/, of>Is saidOptimizing the +.o. of the multimodal fusion feature matrix>And characteristic values.

8. The system for reducing the audio path delay of the Android system is characterized by comprising the following components:

a playing module for playing the processed sound input signal through a speaker of the computing device;

wherein, the signal identification module includes:

the identification data extraction unit is used for extracting data to be identified from the bottom layer data;

the frequency domain analysis unit is used for carrying out frequency domain analysis on the data to be identified to obtain a plurality of frequency domain characteristic statistical values;

the feature encoding unit is used for enabling the plurality of frequency domain feature statistics values and the data to be identified to pass through a cross-mode feature encoder comprising a sequence encoder and an image encoder to obtain a multi-mode fusion feature matrix; and

and the sound input signal determining unit is used for determining whether the data to be identified are sound input signals or not based on the multi-mode fusion feature matrix.