CN116312590A - Auxiliary communication method for intelligent glasses, equipment and medium - Google Patents

Auxiliary communication method for intelligent glasses, equipment and medium Download PDF

Info

Publication number
CN116312590A
CN116312590A CN202310134010.3A CN202310134010A CN116312590A CN 116312590 A CN116312590 A CN 116312590A CN 202310134010 A CN202310134010 A CN 202310134010A CN 116312590 A CN116312590 A CN 116312590A
Authority
CN
China
Prior art keywords
voice
signal
voice signal
auxiliary communication
communication mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310134010.3A
Other languages
Chinese (zh)
Inventor
苏悦
张新科
崔潇潇
鲁勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Intengine Technology Co Ltd
Original Assignee
Beijing Intengine Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Intengine Technology Co Ltd filed Critical Beijing Intengine Technology Co Ltd
Priority to CN202310134010.3A priority Critical patent/CN116312590A/en
Publication of CN116312590A publication Critical patent/CN116312590A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G02OPTICS
    • G02CSPECTACLES; SUNGLASSES OR GOGGLES INSOFAR AS THEY HAVE THE SAME FEATURES AS SPECTACLES; CONTACT LENSES
    • G02C11/00Non-optical adjuncts; Attachment thereof
    • G02C11/10Electronic devices other than hearing aids
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/017Head mounted
    • G02B2027/0178Eyeglass type
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Ophthalmology & Optometry (AREA)
  • Optics & Photonics (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses an auxiliary communication method of intelligent glasses, equipment and media, comprising the following steps: responding to an opening signal of the auxiliary communication mode of the intelligent glasses, and opening the auxiliary communication mode of the intelligent glasses; after the auxiliary communication mode is started, a voice signal is obtained, and enhancement processing and voice separation processing are carried out on the voice signal to obtain at least one voice signal; similarity judgment is carried out on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and the speaker; and converting the voice signal corresponding to at least one speaker into text information, and displaying the text information corresponding to each speaker on the intelligent glasses. The method can enable the intelligent glasses to operate the subsequent signal processing and voice recognition module only when the auxiliary communication mode is started, reduce energy consumption, simultaneously reduce environmental noise and human voice interference of users, separate each human voice in a multi-user conversation scene, improve the accuracy of voice recognition and improve the experience of the users.

Description

Auxiliary communication method for intelligent glasses, equipment and medium
Technical Field
The invention relates to the technical field of intelligent wearing equipment, in particular to an auxiliary communication method of intelligent glasses, equipment and media.
Background
The wearable device is either worn directly on the body or is a portable device that is integrated into the user's clothing or accessories. The wearable device is not only a hardware device, but also realizes various functions through software support, data interaction and cloud interaction, and brings great convenience for life. The intelligent glasses are wearable devices, various functions can be realized through voice or action control, audio of a speaker can be displayed on lenses of the intelligent glasses in a text mode, but in scenes such as noise interference and multi-person conversation, the recognition accuracy of the audio is low due to environmental noise interference, multi-party voice overlapping and the like, and user experience is poor.
Disclosure of Invention
In view of the above, the embodiment of the invention provides an auxiliary communication method for intelligent glasses, equipment and media, so as to solve the problem of low voice recognition accuracy.
According to a first aspect, an embodiment of the present invention provides an auxiliary communication method for smart glasses, including:
responding to an opening signal of the auxiliary communication mode of the intelligent glasses, and opening the auxiliary communication mode of the intelligent glasses;
after the auxiliary communication mode is started, a voice signal is obtained, and enhancement processing and voice separation processing are carried out on the voice signal to obtain at least one voice signal;
similarity judgment is carried out on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and speakers;
and converting the voice signal corresponding to at least one speaker into text information, and displaying the text information corresponding to each speaker on the intelligent glasses.
According to the auxiliary communication method for the intelligent glasses, after an auxiliary communication mode is started, the obtained voice signals are subjected to enhancement processing and voice separation processing to obtain at least one voice signal, similarity judgment is carried out on voiceprint features of different voice signals, so that the voice signals are corresponding to a speaker, and the voice signals are converted into a character form. The method can enable the intelligent glasses to start voice recognition only when the auxiliary communication mode is started, reduce energy consumption, reduce environmental noise and human voice interference of users, separate each human voice, improve voice recognition efficiency, distinguish each human voice and improve user experience.
In some embodiments, the enabling the auxiliary communication mode of the smart glasses in response to the enabling signal of the auxiliary communication mode of the smart glasses includes:
acquiring a current voice signal;
and determining an opening signal of the auxiliary communication mode of the intelligent glasses based on a preset keyword in the current voice signal so as to open the auxiliary communication mode of the intelligent glasses.
In some embodiments, the enabling the auxiliary communication mode of the smart glasses in response to the enabling signal of the auxiliary communication mode of the smart glasses includes:
and in response to an opening signal of the auxiliary opening button, opening an auxiliary communication mode of the intelligent glasses.
In some embodiments, the performing the enhancement processing and the voice separation processing on the voice signal to obtain at least one human voice signal includes:
framing, windowing and Fourier transforming the voice signal to obtain the frequency domain characteristics of the voice signal;
performing voice enhancement processing based on the frequency domain characteristics to remove noise in the voice signal and obtain a noise-reduced voice signal;
and in a preset time period, performing voice separation processing on the noise reduction voice signal to obtain at least one human voice signal.
In some embodiments, the performing a voice separation process on the noise-reduced voice signal in a preset time period to obtain at least one human voice signal includes:
storing the noise-reduced voice signal to a cache area;
and in a preset time period, performing voice separation processing on the noise reduction voice signal based on a preset voice separation model to obtain at least one human voice signal.
In some embodiments, the performing similarity determination on the voiceprint features of different voice signals to determine a correspondence between the voice signals and a speaker includes:
detecting a mute signal in the voice signal and removing the mute signal;
performing similarity judgment on the voice signal from which the mute signal is removed to identify at least one speaker corresponding to the voice signal;
and numbering the speaker to determine the corresponding relation between the voice signal and the speaker.
In some embodiments, the silence signal-removed voice signal includes at least one channel, the channel is consistent with the number of speakers, and the similarity decision is performed on the silence signal-removed voice signal to identify at least one speaker corresponding to the voice signal, including:
extracting voiceprint characteristics of the human voice signals of all channels;
comparing the similarity of the voiceprint characteristics with the currently stored voiceprint characteristics, and determining that the voiceprint characteristics are consistent with speakers corresponding to the currently stored voiceprint characteristics when the voiceprint characteristics are judged to be consistent with the currently stored voiceprint characteristics;
and storing the voiceprint features when the voiceprint features are judged not to be stored.
According to a second aspect, an embodiment of the present invention provides smart glasses, including:
the auxiliary opening module is used for responding to an opening signal of the auxiliary communication mode of the intelligent glasses and opening the auxiliary communication mode of the intelligent glasses;
the first signal processing module is used for acquiring an initial voice signal after the auxiliary communication mode is started, and performing enhancement processing and voice separation processing on the initial voice signal to obtain at least one voice signal;
the signal identification module is used for carrying out similarity judgment on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and a speaker;
and the signal conversion module is used for converting the voice signal corresponding to at least one speaker into text information and displaying the text information corresponding to each speaker on the intelligent glasses.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: the intelligent glasses auxiliary communication system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the intelligent glasses auxiliary communication method in the first aspect or any implementation mode of the first aspect is executed.
According to a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where the computer readable storage medium stores computer instructions for causing the computer to perform the method for auxiliary communication of smart glasses according to the first aspect or any implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of auxiliary communication for smart glasses according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a separable designated number of models in accordance with an embodiment of the present invention;
FIG. 3 is a block diagram of the structure of smart glasses according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
According to an embodiment of the present invention, there is provided an embodiment of a method of assisted communication for smart glasses, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than what is shown or described herein.
In this embodiment, an auxiliary communication method for smart glasses is provided, which may be used for smart glasses, and fig. 1 is a flowchart of an auxiliary communication method for smart glasses according to an embodiment of the present invention, as shown in fig. 1, where the flowchart includes the following steps:
s11, in response to an opening signal of the auxiliary communication mode of the intelligent glasses, the auxiliary communication mode of the intelligent glasses is started.
The intelligent glasses are provided with microphones, and external voice signals are continuously received through the microphones, wherein the voice signals can comprise voices of users of the intelligent glasses, voices of other speakers and the like. The intelligent glasses can receive an auxiliary alternating current mode starting signal from an external voice signal and also can receive an auxiliary alternating current mode starting signal initiated by a user. And starting an auxiliary communication mode of the intelligent glasses according to the received starting signal. Because the source of the opening signal is different, the auxiliary communication mode is opened automatically and passively, namely the user of the intelligent glasses actively opens. In order to save energy consumption under the conventional state, the intelligent glasses are in a low-power consumption mode, and an auxiliary communication mode is started after an opening signal is received, so that unnecessary energy consumption is avoided. If the opening signal is not received within a certain time, the intelligent glasses can automatically exit the auxiliary communication mode, and the auxiliary communication mode can be actively closed by a user. After the auxiliary communication mode is started, words spoken by a speaker are displayed on the lenses of the intelligent glasses in a text mode for reference of users of the intelligent glasses. The user may be an impaired person and the speaker is typically a person talking to the user.
S12, after the auxiliary communication mode is started, a voice signal is obtained, and enhancement processing and voice separation processing are carried out on the voice signal to obtain at least one voice signal.
After the auxiliary communication mode is started, voice enhancement is carried out on the voice signals collected through the microphone, the voice and the environmental noise of the user of the intelligent glasses are removed by the obtained enhanced voice signals, and only the voice of the person talking with the user is reserved. The enhanced voice signal is subjected to voice separation, and as more than one person and a user can possibly be in a dialogue scene, voiceprints in the voice signal need to be distinguished to obtain at least one human voice signal. When multiple speakers are present, the resulting vocal signal is also referred to as a multi-channel speech signal. The speaker separation model may be pre-trained, the speech signal is processed, i.e. speech separation processing,
s13, similarity judgment is carried out on voiceprint features of different voice signals so as to determine the corresponding relation between the voice signals and the speaker.
And identifying the voiceprint characteristics of the obtained voice signals, judging whether the voice signals are stored voice prints of the speaker after the auxiliary communication mode is started, if not, adding the stored voice prints of the speaker, and numbering the speaker so that each voice signal corresponds to the numbered speaker.
S14, converting the voice signal corresponding to at least one speaker into text information, and displaying the text information corresponding to each speaker on the intelligent glasses.
And carrying out recognition analysis on the voice signals of the determined corresponding speakers, so as to convert voice information into text information, displaying the voice recognition result by the number of the speakers, displaying the recognition result of the same speaker in one row as far as possible, and labeling the number of the speakers.
According to the auxiliary communication method for the intelligent glasses, after an auxiliary communication mode is started, the obtained voice signals are subjected to enhancement processing and voice separation processing to obtain at least one voice signal, similarity judgment is carried out on voiceprint features of different voice signals, so that the voice signals are corresponding to a speaker, and the voice signals are converted into a character form. The method can enable the intelligent glasses to start voice recognition only when the auxiliary communication mode is started, reduce energy consumption, reduce environmental noise and human voice interference of users, separate each human voice, improve voice recognition efficiency, distinguish each human voice and improve user experience.
In some embodiments, S11 in fig. 1 includes the steps of:
s21, acquiring a current voice signal.
S22, determining an opening signal of the auxiliary communication mode of the intelligent glasses based on a preset keyword in the current voice signal so as to open the auxiliary communication mode of the intelligent glasses.
The intelligent glasses continuously receive signals through the microphone, when sounds made by users of the intelligent glasses or other external environmental sounds are monitored through the bone conduction microphone to contain preset keywords, the starting signals of the auxiliary communication mode of the intelligent glasses can be determined and obtained, the auxiliary communication mode is started, the intelligent glasses automatically enter the noise reduction working mode, and state prompts for entering the auxiliary communication mode are given on the glasses lenses. In this embodiment, the smart glasses will automatically perform the auxiliary communication mode without the user of the smart glasses doing anything. The key word opening in the voice signal can be realized by a voiceprint recognition module and a key word detection module which are arranged in the intelligent glasses.
In some embodiments, S11 in fig. 1 may further include: and in response to an opening signal of the auxiliary opening button, opening an auxiliary communication mode of the intelligent glasses.
If the user finds that the auxiliary communication mode is not started in the communication process, the user can also actively send out a set keyword to start the auxiliary communication mode or start the auxiliary communication mode in a key manner.
In some embodiments, S12 in fig. 1 includes the following steps:
s31, framing, windowing and Fourier transforming the voice signal to obtain the frequency domain characteristics of the voice signal.
The obtained voice signal is firstly subjected to framing, windowing and Fourier transformation to obtain the frequency domain characteristics of the voice signal.
S32, performing voice enhancement processing based on the frequency domain characteristics to remove noise in the voice signal and obtain a noise-reduced voice signal.
And carrying out voice enhancement on the obtained frequency domain characteristics and voiceprint characteristics of the users of the intelligent glasses through a noise reduction model, and removing the voice of the users and the environmental noise by the obtained enhanced noise reduction voice signals, wherein only the voice of the speaker is reserved. For noise reduction models, the user of the smart glasses that provide voiceprint features belongs to an interfering speaker whose speech needs to be eliminated as a noise reduction object.
When the voiceprint feature of the user is acquired, firstly, the voice of the user recorded in the quiet environment is required to be acquired, and framing, windowing, fourier transformation, mel filtering and logarithm taking are carried out on the voice to obtain the FBank feature. And then, sending the obtained FBank characteristics into a pre-trained voiceprint characteristic extractor to obtain voiceprint characteristics. Voiceprint feature extractor refers to a multi-level neural network trained using multiple speaker data sets that can map FBank features of different speakers to more differentiated feature spaces.
The personalized noise reduction model is constructed by a multi-level neural network, and the training of the personalized noise reduction model needs to use a multi-speaker data set. And performing time domain superposition on clean audios of different speakers, and superposing various noises to obtain amplified voices. The audio that needs to be preserved when noise is reduced is called a target speaker, and the audio that needs to be removed when noise is reduced is called an interfering speaker. And obtaining voiceprint features of the interference speaker by using a pre-trained voiceprint feature extractor, extracting frequency domain features of the amplified voice, combining the voiceprint features of the interference speaker, and training by taking the frequency domain features of the clean audio of the target speaker as a tag. The resulting personalized noise reduction model learns to have the ability to suppress interfering audio and various environmental noises having specified voiceprint characteristics.
S33, performing voice separation processing on the noise reduction voice signal in a preset time period to obtain at least one human voice signal.
And setting a time period, and starting the voice separation processing when the number of frames of the collected voice signals reaches a specified number.
Specifically, S33 includes the steps of:
s331, storing the noise reduction voice signal in the buffer area.
And storing the noise reduction voice signals into a buffer area, and performing voice separation after the buffer area is full of the designated number of voice frames.
S332, performing voice separation processing on the noise reduction voice signal based on a preset voice separation model in a preset time period to obtain at least one human voice signal.
More than one type of human voice signal, i.e., multichannel speaker speech, may be obtained. The voice separation can be performed by a blind source separation method based on the conventional signal processing theory, a deep learning speaker separation method (Speaker Separation), and the like.
In this embodiment, taking a deep learning-based speaker separation method as an example, a data set including a plurality of speakers needs to be prepared first, N different speakers are randomly extracted from the data set to be overlapped with a random signal-to-noise ratio, where the value of N may be any value, and the value of N correspondingly trains the speech separation models of different numbers of speakers. The speech separation model comprises a plurality of nonlinear layers, a convolutional neural network, a cyclic neural network and the like can be used for constructing the model, and a separable designated number of model structures are shown in figure 2.
In a multi-person conversation scenario, the number of actual speakers is unknown, and thus, it is necessary to cope with the situation where the number of speakers is unknown. In this embodiment, a speech separation model of multiple speakers is built, and separation audios of two, three, four and five channels are obtained by decoding respectively, so as to cope with separation scenes of different numbers of speakers. In practical use, a voice activity detector may be used to detect output split audio, starting from the model with the largest number of speakers, and if the split audio of a certain channel does not detect speech, using the split model with the number of speakers minus one, and repeating this process until all output channels contain speech. The classifier can also be trained or a clustering method can be used to determine the number of speakers of the current audio first, and then the corresponding model can be selected to separate the audio of different speakers.
Taking a speech separation model with two training speakers as an example, randomly extracting audio frequency fragments of two different speakers from a training set, overlapping the audio frequency fragments according to a designated signal to noise ratio to obtain mixed audio frequency, sending the mixed audio frequency into the built speech separation model, and obtaining two-channel separation audio frequency through model prediction. And calculating uPIT (utterance-level Permutation Invariant Training) loss of the obtained two-channel separated audio by taking unmixed audio before superposition and after audio amplitude adjustment according to the signal-to-noise ratio as a training label. The loss function is as follows:
Figure BDA0004087983350000091
the function calculates the loss of the optimal arrangement pi of C different output channels, wherein C represents the number of speakers in the mixed audio C Is all possible permutations of the C output channels, SI-SNR is a scale-invariant signal-to-noise loss, defined as follows:
Figure BDA0004087983350000092
Figure BDA0004087983350000093
wherein s is i A tag voice signal representing channel i,
Figure BDA0004087983350000094
representing the predicted speech signal for channel i. And performing back propagation training based on the loss function obtained by calculation to obtain a voice separation model.
The noise reduction voice signal is processed based on the obtained voice separation model to obtain a voice signal of at least one channel, and the voice separation models of other numbers of speakers are the same as the above method, and are not described herein.
In some embodiments, S13 in fig. 1 includes the following steps:
s41, detecting a mute signal in the voice signal and removing the mute signal.
And removing a mute channel in the voice signal through the voice activity detection module.
S42, similarity judgment is carried out on the voice signal from which the mute signal is removed, so that at least one speaker corresponding to the voice signal is identified.
S43, numbering the speakers to determine the corresponding relation between the voice signals and the speakers.
Wherein, the voice signal from which the mute signal is removed includes at least one channel, the number of channels is consistent with the number of speakers, and S42 specifically includes:
s51, extracting voiceprint features of the human voice signals of all channels.
S52, comparing the similarity of the voiceprint characteristics with the currently stored voiceprint characteristics, and determining that the voiceprint characteristics are consistent with the speaker corresponding to the currently stored voiceprint characteristics when the voiceprint characteristics are judged to be consistent with the currently stored voiceprint characteristics.
And S53, storing the voiceprint features when the voiceprint features are judged not to be stored.
In a multi-person conversation scenario, the speaking sequence and speaking duration of different speakers are not fixed, and each speaker separates the obtained multi-channel voice signals, and each channel does not necessarily belong to the same speaker. This results in that the speech recognition result obtained in each channel does not necessarily belong to the same speaker, and when the recognition result is displayed on the lens of the intelligent glasses, if the speech recognition result of each channel is only displayed according to the channel number, the wearer of the intelligent glasses cannot distinguish the speaker corresponding to each section of speech according to the channel number, and cannot well link the speech segments belonging to the same speaker.
In this embodiment, the voiceprint features of the remaining channels are extracted and stored using a voiceprint feature extractor, and the voiceprint features of different speakers are numbered as speakers 1, 2. And extracting the voiceprint characteristics of the speaker of each channel in the voice signal from which the mute signal is removed, and carrying out similarity judgment with the stored voiceprint characteristics, wherein cosine similarity can be adopted in the similarity judgment method. If the stored voiceprint features meeting the specified judgment threshold exist, the voice signals are considered to belong to the same speaker, the voice signals after the section separation are numbered as the speaker to which the voice signals belong, and the voiceprint features are discarded or added with the stored original voiceprint features of the speaker to which the voice signals belong to replace the original features for storage. If the voice section does not meet the specified judgment threshold value, the speaker to which the voice section belongs is considered not to appear, the voiceprint characteristics of the speaker are stored, and the voiceprint characteristics are numbered as speakers k+1, and k is the maximum speaker number at the previous moment. And when the multi-person conversation is finished, before the intelligent glasses close the auxiliary communication mode, the cached voiceprint features and speaker numbers are emptied.
Classifying each section of recognized audio according to the number of the corresponding speaker, and keeping the voices belonging to the same speaker in one channel and outputting the voices to a recognition network.
In this embodiment, an intelligent glasses is further provided, and the intelligent glasses are used to implement the foregoing embodiments and implementations, and are not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The embodiment provides an intelligent glasses, as shown in fig. 3, including:
the auxiliary opening module is used for responding to an opening signal of the auxiliary communication mode of the intelligent glasses and opening the auxiliary communication mode of the intelligent glasses;
the first signal processing module is used for acquiring an initial voice signal after the auxiliary communication mode is started, and performing enhancement processing and voice separation processing on the initial voice signal to obtain at least one voice signal;
the signal identification module is used for carrying out similarity judgment on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and a speaker;
and the signal conversion module is used for converting the voice signal corresponding to at least one speaker into text information and displaying the text information corresponding to each speaker on the intelligent glasses.
In some embodiments, the auxiliary opening module includes:
the signal acquisition unit is used for acquiring a current voice signal;
an automatic opening unit, configured to determine an opening signal of the auxiliary communication mode of the smart glasses based on a preset keyword in the current voice signal, so as to open the auxiliary communication mode of the smart glasses
In some embodiments, the auxiliary opening module includes:
and the passive opening unit is used for responding to the opening signal of the auxiliary opening button and opening the auxiliary communication mode of the intelligent glasses.
In some embodiments, the first signal processing module comprises:
the frequency domain conversion unit is used for framing, windowing and Fourier transforming the voice signal to obtain the frequency domain characteristics of the voice signal;
the voice enhancement unit is used for performing voice enhancement processing based on the frequency domain characteristics so as to remove noise in the voice signal and obtain a noise-reduction voice signal;
and the voice separation unit is used for carrying out voice separation processing on the noise reduction voice signal in a preset time period to obtain at least one human voice signal.
In some embodiments, the speech separation unit comprises:
a signal storage subunit, configured to store the noise-reduced speech signal into a buffer area;
and the voice separation subunit is used for carrying out voice separation processing on the noise reduction voice signal based on a preset voice separation model in a preset time period to obtain at least one human voice signal.
In some embodiments, the signal identification module comprises:
the signal removing unit is used for detecting a mute signal in the voice signal and removing the mute signal;
the signal detection unit is used for carrying out similarity judgment on the voice signal from which the mute signal is removed so as to identify at least one speaker corresponding to the voice signal;
and the voice numbering unit is used for numbering the speaker so as to determine the corresponding relation between the voice signal and the speaker.
In some embodiments, the silence signal-removed voice signal includes at least one channel, the channel corresponding to the number of speakers, and the signal detection unit includes:
the characteristic extraction subunit is used for extracting voiceprint characteristics of the human voice signals of all the channels;
the similarity comparison subunit is used for comparing the similarity of the voiceprint feature with the currently stored voiceprint feature, and determining that the voiceprint feature is consistent with a speaker corresponding to the currently stored voiceprint feature when judging that the voiceprint feature is consistent with the currently stored voiceprint feature;
and the feature storage subunit is used for storing the voiceprint features when judging that the voiceprint features are not stored.
The smart glasses in this embodiment are presented in the form of functional units, where the units refer to ASIC circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above-described functionality.
Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the invention also provides electronic equipment, which is provided with the intelligent glasses.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 4, the electronic device may include: at least one processor 601, such as a CPU (Central Processing Unit ), at least one communication interface 603, a memory 604, at least one communication bus 602. Wherein the communication bus 602 is used to enable connected communications between these components. The communication interface 603 may include a Display screen (Display), a Keyboard (Keyboard), and the selectable communication interface 603 may further include a standard wired interface, and a wireless interface. The memory 604 may be a high-speed RAM memory (Random Access Memory, volatile random access memory) or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 604 may also optionally be at least one storage device located remotely from the processor 601. Wherein the processor 601 may incorporate the smart glasses described above, an application program is stored in the memory 604, and the processor 601 invokes the program code stored in the memory 604 for performing any of the method steps described above.
The communication bus 602 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The communication bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Wherein the memory 604 may comprise volatile memory (english) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid state disk (english: solid-state drive, abbreviated as SSD); memory 604 may also include a combination of the types of memory described above.
The processor 601 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.
The processor 601 may further comprise a hardware chip, among other things. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof (English: programmable logic device). The PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviated: FPGA), a general-purpose array logic (English: generic array logic, abbreviated: GAL), or any combination thereof.
Optionally, the memory 604 is also used for storing program instructions. The processor 601 may invoke program instructions to implement the auxiliary communication method of the smart glasses as shown in the embodiments of the present application.
The embodiment of the invention also provides a non-transitory computer storage medium, which stores computer executable instructions, and the computer executable instructions can execute the auxiliary communication method of the intelligent glasses in any method embodiment. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (10)

1. An auxiliary communication method for intelligent glasses is characterized by comprising the following steps:
responding to an opening signal of the auxiliary communication mode of the intelligent glasses, and opening the auxiliary communication mode of the intelligent glasses;
after the auxiliary communication mode is started, a voice signal is obtained, and enhancement processing and voice separation processing are carried out on the voice signal to obtain at least one voice signal;
similarity judgment is carried out on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and speakers;
and converting the voice signal corresponding to at least one speaker into text information, and displaying the text information corresponding to each speaker on the intelligent glasses.
2. The method of claim 1, wherein the enabling the auxiliary communication mode of the smart glasses in response to the enabling signal of the auxiliary communication mode of the smart glasses comprises:
acquiring a current voice signal;
and determining an opening signal of the auxiliary communication mode of the intelligent glasses based on a preset keyword in the current voice signal so as to open the auxiliary communication mode of the intelligent glasses.
3. The method of claim 1, wherein the enabling the auxiliary communication mode of the smart glasses in response to the enabling signal of the auxiliary communication mode of the smart glasses comprises:
and in response to an opening signal of the auxiliary opening button, opening an auxiliary communication mode of the intelligent glasses.
4. The method of claim 1, wherein the performing the enhancement processing and the speech separation processing on the speech signal to obtain at least one human voice signal comprises:
framing, windowing and Fourier transforming the voice signal to obtain the frequency domain characteristics of the voice signal;
performing voice enhancement processing based on the frequency domain characteristics to remove noise in the voice signal and obtain a noise-reduced voice signal;
and in a preset time period, performing voice separation processing on the noise reduction voice signal to obtain at least one human voice signal.
5. The method of claim 4, wherein the performing a speech separation process on the noise-reduced speech signal during a predetermined period of time to obtain at least one human voice signal comprises:
storing the noise-reduced voice signal to a cache area;
and in a preset time period, performing voice separation processing on the noise reduction voice signal based on a preset voice separation model to obtain at least one human voice signal.
6. The method of claim 1, wherein the performing similarity decisions on voiceprint features of different voice signals to determine correspondence between the voice signals and speakers comprises:
detecting a mute signal in the voice signal and removing the mute signal;
performing similarity judgment on the voice signal from which the mute signal is removed to identify at least one speaker corresponding to the voice signal;
and numbering the speaker to determine the corresponding relation between the voice signal and the speaker.
7. The method of claim 6, wherein the voice signal from which the mute signal is removed includes at least one channel, the channel corresponding to a number of speakers, wherein the performing a similarity decision on the voice signal from which the mute signal is removed to identify at least one speaker to which the voice signal corresponds comprises:
extracting voiceprint characteristics of the human voice signals of all channels;
comparing the similarity of the voiceprint characteristics with the currently stored voiceprint characteristics, and determining that the voiceprint characteristics are consistent with speakers corresponding to the currently stored voiceprint characteristics when the voiceprint characteristics are judged to be consistent with the currently stored voiceprint characteristics;
and storing the voiceprint features when the voiceprint features are judged not to be stored.
8. An intelligent eyeglass, comprising:
the auxiliary opening module is used for responding to an opening signal of the auxiliary communication mode of the intelligent glasses and opening the auxiliary communication mode of the intelligent glasses;
the first signal processing module is used for acquiring an initial voice signal after the auxiliary communication mode is started, and performing enhancement processing and voice separation processing on the initial voice signal to obtain at least one voice signal;
the signal identification module is used for carrying out similarity judgment on voiceprint characteristics of different voice signals so as to determine the corresponding relation between the voice signals and a speaker;
and the signal conversion module is used for converting the voice signal corresponding to at least one speaker into text information and displaying the text information corresponding to each speaker on the intelligent glasses.
9. An electronic device, comprising:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of assisted communication for smart glasses according to any one of claims 1-7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions for causing a computer to perform the method of assisted communication of smart glasses according to any one of claims 1-7.
CN202310134010.3A 2023-02-07 2023-02-07 Auxiliary communication method for intelligent glasses, equipment and medium Pending CN116312590A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310134010.3A CN116312590A (en) 2023-02-07 2023-02-07 Auxiliary communication method for intelligent glasses, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310134010.3A CN116312590A (en) 2023-02-07 2023-02-07 Auxiliary communication method for intelligent glasses, equipment and medium

Publications (1)

Publication Number Publication Date
CN116312590A true CN116312590A (en) 2023-06-23

Family

ID=86789849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310134010.3A Pending CN116312590A (en) 2023-02-07 2023-02-07 Auxiliary communication method for intelligent glasses, equipment and medium

Country Status (1)

Country Link
CN (1) CN116312590A (en)

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN108630193B (en) Voice recognition method and device
CN105632486B (en) Voice awakening method and device of intelligent hardware
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
JP2020515877A (en) Whispering voice conversion method, device, device and readable storage medium
Gogate et al. DNN driven speaker independent audio-visual mask estimation for speech separation
CN103151039A (en) Speaker age identification method based on SVM (Support Vector Machine)
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN113205803B (en) Voice recognition method and device with self-adaptive noise reduction capability
CN112507311A (en) High-security identity verification method based on multi-mode feature fusion
CN112183107A (en) Audio processing method and device
CN114267347A (en) Multi-mode rejection method and system based on intelligent voice interaction
CN109065026B (en) Recording control method and device
CN113793624B (en) Acoustic scene classification method
CN113345466B (en) Main speaker voice detection method, device and equipment based on multi-microphone scene
CN111667834B (en) Hearing-aid equipment and hearing-aid method
CN113921026A (en) Speech enhancement method and device
Bonet et al. Speech enhancement for wake-up-word detection in voice assistants
KR20210066774A (en) Method and Apparatus for Distinguishing User based on Multimodal
Abdulatif et al. Investigating cross-domain losses for speech enhancement
CN112189232A (en) Audio processing method and device
CN116312590A (en) Auxiliary communication method for intelligent glasses, equipment and medium
Wang et al. A fusion model for robust voice activity detection
Abel et al. Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination