CN113794963B

CN113794963B - Speech enhancement system based on low-cost wearable sensor

Info

Publication number: CN113794963B
Application number: CN202111075171.7A
Authority: CN
Inventors: 邹永攀; 洪史聪; 郑楚育; 伍楷舜
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-08-05
Anticipated expiration: 2041-09-14
Also published as: CN113794963A

Abstract

The invention discloses a voice enhancement system based on a low-cost wearable sensor. The system comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are respectively used for collecting a sound signal and a neck vibration signal in an ear canal when a user produces sound under the control of the micro control unit and transmitting the sound signal and the neck vibration signal to the intelligent terminal; the intelligent device is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into the pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable device. The invention can convert low-cost wearable sensor signals into high-quality signals under the condition of protecting the privacy of users, and is suitable for daily use.

Description

Speech enhancement system based on low-cost wearable sensor

Technical Field

The invention relates to the technical field of wearable equipment, in particular to a voice enhancement system based on a low-cost wearable sensor.

Background

Human-computer interaction is one of the most important functions of the equipment, is limited by the morphological design of wearable equipment, and influences the user experience. In the interaction of the wearable device, the voice interaction is a natural and low-learning-cost mode. While smart headsets dominate most of the market share in wearable devices, similar neck worn devices (smart necklaces, neck-worn headsets) are also an emerging wearable device that may be accepted by users. For wearable devices, only one microphone needs to be equipped for them to complete the voice input interaction. However, due to the characteristics of the microphone, the microphone is sensitive to external environmental noise, which also results in that the recorded data contains more noise and has poor quality. Furthermore, due to the requirement for miniaturization of wearable devices, the processing power of the wearable devices is limited, and the sampling cannot be completed at a high sampling rate. Meanwhile, the requirement on transmission real-time performance is high in the daily recording scene of the wearable device. Therefore, low cost wearable devices, due to the limitations of the hardware itself, can typically only acquire the user's voice data at a lower sampling rate and with a lower sampling quality. The user can obviously perceive the difference of the audio data recorded by the microphone with high cost and low cost. Therefore, under a wearable scene, the voice quality of a receiving end of a user when the user uses low-cost equipment is improved, high-quality voice data is kept as far as possible, and the user experience can be improved under the local recording and conversation scene.

With the continuous development of deep learning technology, related applications of the deep learning technology appear in various fields. However, how to convert from a low-quality sensor signal to a high-quality sensor signal has not been specifically explored. The existing related research fields include

1) And audio super-resolution technology. And the sound signal with low sampling rate is up-sampled into the sound signal with high sampling rate in a deep learning mode so as to improve the audio quality.

2) Mapping techniques across sensors or modalities. For cross-sensor related techniques, such as restoring the accelerometer signal back to an audio signal, etc.; for cross-modality related technologies, for example, a technology of restoring speech by text, a technology of restoring text by video, and the like.

3) Techniques for speech enhancement. Such as speech noise reduction, multi-modal speech enhancement, etc.

4) Wearable interaction technology. Such as gesture recognition, voice input, etc.

However, the existing voice interaction scheme is difficult to deploy due to the complexity of the deep learning model, or is difficult to be practically applied to the voice interaction of the smart wearable device due to the inconvenience of interaction or due to the influence of external noise.

Disclosure of Invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a low-cost wearable sensor based speech enhancement system.

According to a first aspect of the invention, a low-cost wearable sensor based speech enhancement system is provided. The system comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are respectively used for collecting a sound signal and a neck vibration signal in an ear canal when a user produces sound under the control of the micro control unit and transmitting the sound signal and the neck vibration signal to the intelligent terminal; the intelligent device is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into the pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable device.

According to a second aspect of the invention, a low-cost wearable sensor based speech enhancement method is provided. The method comprises the following steps:

collecting neck vibration signals and sound signals in an ear canal when a user produces sound;

after the neck vibration signals and the sound signals in the auditory meatus are aligned, corresponding time-frequency graphs or time sequences are extracted and input into a pre-trained deep learning model to obtain target quality voice signals, and the resolution and the user hearing of the target quality voice signals are superior to those of signals collected by the wearable equipment.

Compared with the prior art, the voice enhancement system based on the low-cost wearable sensor has the advantages that the voice enhancement system based on the low-cost wearable sensor can meet the function of a common earphone, can also collect the voice spread in the auditory canal and the signals of the piezoelectric ceramic piece, and can collect better original signals due to the characteristics of the collection position and the sensor. The invention can convert low-cost wearable sensor signals into high-quality signals under the condition of protecting the privacy of users, realizes the voice effect superior to that of high-cost sensors, and is suitable for daily use.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a block diagram of a low cost wearable sensor based speech enhancement system according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of a user wearing a speech enhancement system according to one embodiment of the present invention;

FIG. 3 is a flow diagram of a low cost wearable sensor based speech enhancement method according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of a low cost wearable sensor based speech enhancement process according to one embodiment of the present invention;

FIG. 5 is a flow diagram of detecting a voice event and running a deep learning model at a terminal according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The voice enhancement system based on the low-cost wearable sensor can be applied to the enhancement processing of the signals acquired by the low-cost sensor and the restoration of the signals into high-quality vibration signals, and the restored signals are superior to original signals. Referring to fig. 1 and 2, fig. 1 (a) is a front view of the system, and fig. 1 (b) is a rear view of the system, and the system as a whole is provided including a wearable device and a smart device, wherein the wearable device includes a micro control unit 1, an audio processing unit 2, a filter 3, a storage unit 4, an in-ear headphone 5, a piezoceramic sheet 6, and a power supply module 7. The wearable device may be a smart necklace, a neck-worn headset, or a neck strap, etc. The smart device may be a smart terminal or other wearable devices, and other types of electronic devices, such as a smart phone, a tablet electronic device, a desktop computer, or a vehicle-mounted device, and fig. 1 illustrates an example of a smart phone.

The MCU 1 may use a high performance chip as a central control unit for controlling coordination between other modules or units and communication between the wearable device and the smart phone, for example, the MCU 1 uses ESP32 as a stand-alone system running an application or a slave to the MCU and includes a communication module, such as providing Wi-Fi and bluetooth functions through SPI/SDIO or I2C/UART interfaces.

The audio processing unit 2 may employ an integrated chip, such as VS1053, which is an audio decoding module using SPI communication, and supports decoding playing and encoding saving of audio files.

The filter 3 has a filtering and amplifying function, and an LM358 including a dual operational amplifier circuit can be used.

The storage unit 4 employs, for example, an SD card for storing the acquired signal or audio file.

The microphone inside the in-ear earphone 5 is used for collecting sound signals in the ear canal when a user produces sound, and the environmental noise can be shielded to a certain extent by collecting the sound signals in the ear canal.

And the piezoelectric ceramic piece 6 is used for collecting a neck vibration signal when the user produces sound. The piezoelectric ceramic piece is very sensitive to vibration signals and can generate voltage changes with different amplitudes according to the vibration amplitude. Further, in order to make the acquired neck vibration signal more accurate or pure, a filtering and amplifying circuit can be used for moderately amplifying and filtering the effective signal. When a user wears the provided wearable device, the piezoelectric ceramic sheet is attached near the vocal cords of the user. The method is beneficial to obtaining accurate neck vibration signals and is convenient to carry.

The power module 7 is used to provide power to the wearable device, and may be of a common battery type, such as LiPO.

Hereinafter, with reference to fig. 3 and fig. 4, a speech enhancement process is described by taking an android phone as an example, which specifically includes the following steps:

and step S1, respectively collecting neck vibration signals and sound signals in the auditory canal when the user vocalizes through the piezoelectric ceramic piece and the microphone on the in-ear earphone.

For example, the wearable device housing is provided with a button key for starting a voice enhancement function, which can be manually turned on by a user, in addition to a power-on key and a microphone and a piezoelectric ceramic sheet for normal use. In one embodiment, the button keys that activate the voice enhancement function are located on the side of the housing and the piezoceramic wafer is located on the back of the system.

In step S1, the filter 3 with filtering and amplifying functions is used to effectively filter noise signals including the mains supply, increase and retain effective signals, the ESP32 collects neck vibration signals at a sampling rate of 10KHz through the high-speed ADC, and at the same time, temporarily stores the data into the SD card, and the SD card receives the data (Secure Digital Input and Output) through the SDIO bus. When the user wears the wearable equipment, the piezoelectric ceramic piece sensor is attached to the side face of the vocal cord position, so that the contact area is ensured as much as possible.

Preferably, an integrated audio processing chip VS1053 is adopted to collect the sound signal, compared with the method that an audio amplifier is used to directly collect the sound signal, a digital filter is arranged in the VS1053, and a series of complex software filtering algorithms are built in, so that the processing amount of the ear canal signal can be effectively reduced. The sampling rate of the sound signals is controlled to be 10KHz as well as VS1053 and ESP32 to carry out data transmission through SPI communication protocol, and the collected sound signals are converted into WAV audio format to be stored in SD card.

And step S2, storing and transmitting the collected data to the intelligent terminal for data processing.

For example, in step S2, the data transmission at the mobile phone end uses a bluetooth protocol that is commonly used, and in order to reduce the data transmission amount as much as possible, the data is compressed, for example, by using a huffman algorithm, the piezoelectric ceramic piece data and the ear channel data are compressed, and all the data is transmitted in a binary original data manner when being transmitted, and a check code and a frame header and a frame tail are added to the data, which can effectively transmit the data to the mobile phone for processing.

Step S3, noise filters the signal and uses voice activity detection to acquire voice events.

Referring to fig. 5, the step of detecting a sound event includes:

s31, recording a segment of data without voice after the user wears the equipment; and performing data framing on the original signal, and performing noise filtering processing.

Specifically, data when the current user has no voice is obtained as a reference source of the noise signal. The time-frequency spectrum of the noise is then calculated by a time-frequency domain transform, such as a short-time fourier transform (STFT). And obtaining a threshold value of the noise according to the mean value and the variance of the time frequency spectrum. And similarly, calculating a time-frequency spectrum of the original signal, and removing a part which is lower than a noise threshold value in the original signal as noise to obtain noise-filtered data. The above-described denoising process may be performed on both the acoustic signal in the ear canal and the neck vibration signal.

S32, for each frame of signal, it is detected whether there is voice activity.

Specifically, for the signal of each frame processed in S31, Voice Activity Detection (Voice Activity Detection) is used to detect whether there is a Voice Activity or a Voice event, and when there is a Voice event, a neck vibration signal and a Voice signal (or collectively referred to as Voice data) are sent to the smart terminal.

In step S4, the terminal processes the received sound data and feeds back the audio signal with improved quality to the user.

Still referring to fig. 5, step S4 includes:

s41, the terminal synchronizes the time of the received data of the two sensors and aligns the signals.

First, for the voice activity detected in step S32, the terminal receives the data of the two sensors, performs time synchronization, and aligns signals, specifically: 1) the data of the two sensors are first converted into a sequence of window energies, the window size being n. Then, the cross-correlation values of the two sensors are calculated to obtain the cross-correlation value of coarse granularity, and coarse granularity time synchronization is carried out. 2) And acquiring a section of original data (comprising two sensor signals) before and after the position of coarse grain alignment, calculating the cross correlation value of the original data, acquiring the cross correlation value of fine grain, and completing the time synchronization of fine grain on the basis of coarse grain time synchronization to acquire two sensor signals with better time synchronization.

And S42, running a deep learning model on the terminal, enhancing the signal, and feeding back the signal to the user.

Specifically, a deep learning model is first used to convert a low cost, low sampling rate sensor signal to a high cost, low sampling rate sensor signal. Then, another deep learning model is used for converting the sensor signals with high cost and low sampling rate into the sensor signals with high cost and high sampling rate, and then the sensor signals after speech enhancement are fed back to the user, so that the subjective experience of the user is enhanced. The deep learning model embedded in the terminal is obtained after being pre-trained by using a data set, and considering that the computing capability of the terminal is relatively limited, the training process can be carried out on a cloud or a server in an off-line mode.

By utilizing the two deep learning models, the complexity of the deep learning models can be reduced and the training efficiency can be improved. The first deep learning model realizes conversion from a signal corresponding to a low-cost sensor to a signal corresponding to a high-cost sensor (the resolution may be the same), that is, the response and the content quality of the sensor signal are improved, so that the user hearing sense is improved; and the second deep learning model realizes the reconstruction of a low sampling rate signal of the same sensor to a high sampling rate signal, namely a super-resolution technology, and mainly improves the signal sampling rate. In addition, by means of signal reconstruction in two steps, the effect of the first deep learning model can be monitored, an interpretable intermediate result can be obtained, and the method is favorable for knowing which model needs to be improved subsequently. It should be understood that a single model may be used to directly reconstruct the low-cost sensor-corresponding signal to a signal of the target resolution.

In one embodiment, the data set of the pre-trained deep learning model is constructed according to the following steps:

in step S51, the subject (user) utters a normal speech after using the wearable device.

In order to take various practical situations into consideration, various sounding scenes can be set, such as Chinese corpus reading and English corpus reading.

And step S52, collecting the sound signal in the auditory canal and the neck vibration signal.

Specifically, while low-cost sensor low-sampling-rate signal acquisition is carried out, a high-cost high-quality microphone is used for simultaneous recording, and high-cost high-quality signals are acquired. For the case that there is signal misalignment in the two signal recording processes, the method of step S41 may be used to perform time synchronization. And constructing a training data set by using the data after time synchronization.

The deep learning model comprises but is not limited to a convolutional neural network, a cyclic neural network and the like, and finally the purpose of optimizing the low-sampling-rate signal of the low-cost sensor is achieved. And dividing the operation domain into a time domain model and a time-frequency domain model.

For the time domain model, the operation is performed in the time domain, and a network combining a one-dimensional convolutional neural network and a cyclic neural network can be used), so that the reconstruction of the low-quality signal in the time domain into the high-quality signal is realized.

For the time-frequency domain model, the operation is performed on a time-frequency domain, in this case, a signal input is first converted into a time-frequency domain signal through a short-time fourier transform (STFT), the time-frequency domain signal is input into the model (for example, a two-dimensional convolutional neural network), an output result of the model is obtained, and then a time sequence signal is output through an inverse short-time fourier transform (iSTFT).

In conclusion, the invention collects the sound signals in the auditory canal and the neck vibration signals by using the low-cost sensor arranged on the wearable device, and transmits the sound signals and the neck vibration signals to the daily intelligent device for data processing and enhancement, thereby obtaining the enhanced voice signals. Compared with the signal acquired by directly utilizing the earphone or the piezoelectric ceramic piece, the enhanced signal has obvious improvement in the aspects of resolution (sampling rate), user hearing and the like, can achieve or be superior to the effect of directly acquiring the signal by adopting a high-cost high-resolution sensor, and the cost of the ceramic piece, the earphone and the like arranged on the wearable equipment can be lower, so that the hardware cost of the wearable equipment is low, the wearable equipment is suitable for edge equipment, and the effect of recovering the low-quality signal into the high-quality signal can be realized.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A voice enhancement system based on a low-cost wearable sensor comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are controlled by the micro control unit to respectively collect a sound signal and a neck vibration signal in an ear canal when a user produces a sound and transmit the sound signal and the neck vibration signal to the intelligent equipment;

the intelligent equipment is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into a pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable equipment;

the intelligent equipment obtains a target quality voice signal according to the following steps:

inputting the neck vibration signal and the sound signal in the auditory canal into a first deep learning model to obtain a first target quality signal, wherein the first deep learning model reflects the corresponding relation between the sensor signal with low cost and low sampling rate and the sensor signal with high cost and low sampling rate so as to convert the sensor signal with low cost and low sampling rate into the sensor signal with high cost and low sampling rate;

and inputting the first target quality signal into a second deep learning model to obtain a final target quality voice signal, wherein the second deep learning model reflects the corresponding relation between the sensor signal with high cost and low sampling rate and the sensor signal with high cost and high sampling rate so as to convert the sensor signal with high cost and low sampling rate into the sensor signal with high cost and high sampling rate.

2. The system of claim 1, wherein the wearable device housing is provided with an activation key for starting to detect a sound production event according to an activation operation of a user, and transmitting the collected in-ear-canal sound signal and the neck vibration signal to the smart device when the sound production event is detected.

3. The system of claim 1, wherein the position of the piezoceramic wafer is proximate to a user vocal cord position when the wearable device is in a worn state.

4. The system of claim 1, wherein the micro control unit is ESP32, the audio processing unit is audio processing chip VS1053, the filter is LM358, ESP32 performs data transmission through SPI communication protocol, and the collected sound signal is converted into WAV audio format and stored in SD card.

5. The system of claim 1, wherein the wearable device compresses the neck vibration signal and the in-ear sound signal using a huffman algorithm and transmits the compressed data to the smart device.

6. The system of claim 1, wherein the smart device aligns the received neck vibration signal and the in-ear sound signal according to the following steps:

converting the neck vibration signal and the in-ear-canal sound signal into an energy sequence of a predetermined window size;

calculating the cross correlation value of the neck vibration signal and the sound signal in the auditory canal to obtain the cross correlation value of coarse granularity and carry out coarse granularity time synchronization;

for the neck vibration signal and the sound signal in the auditory canal, a section of original data before and after the position of the coarse-grained alignment is obtained, the cross correlation value of the original data and the original data is calculated, the cross correlation value of the fine-grained alignment is obtained, and the time synchronization of the fine-grained alignment is carried out.

7. The system of claim 1, wherein the smart device is a smartphone, a tablet electronic device, a desktop computer, or an in-vehicle device, and the wearable device is a neck-worn device.

8. A low-cost wearable sensor-based speech enhancement method comprises the following steps:

after aligning the neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into a pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal collected by wearable equipment;

wherein the target quality speech signal is obtained according to the following steps:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 8.