CN112885345A

CN112885345A - Special garment voice interaction system and method

Info

Publication number: CN112885345A
Application number: CN202110040219.4A
Authority: CN
Inventors: 马翼平; 王阳; 于泽; 王诗怡; 许召辉
Original assignee: Avic East China Photoelectric Shanghai Co ltd
Current assignee: Avic East China Photoelectric Shanghai Co ltd
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-06-01

Abstract

The invention discloses a voice interaction system for special clothes, which comprises: the microphone array is arranged in the special garment and used for acquiring voice signals; the voice recognition module is used for recognizing the voice signal and outputting a recognition result; the voice alarm module is used for acquiring an external sound source signal and positioning information of the external sound source signal; the control module is used for outputting a control signal according to the recognition result so as to control equipment operation, and outputting a first alarm voice according to the external sound source signal or outputting a second alarm voice according to the external sound source signal and the positioning information of the external sound source signal; the voice synthesis module is used for receiving the control signal and outputting synthesized voice; the stereo broadcasting device is arranged in the special clothes and is used for receiving and broadcasting the first warning voice, the second warning voice or the synthesized voice. The system can effectively restrain noise generated in a special environment of the extravehicular suit and accurately identify the voice command sent by the operating personnel.

Description

Special garment voice interaction system and method

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice interaction system and method for special clothes.

Background

The special clothes are protective clothes designed for special operations. In order to achieve the protection purpose, the operator cannot flexibly move the joint part to operate and control the equipment when wearing the clothing, the visual field range can be narrowed, the surrounding objects cannot be efficiently explored, and especially in some special environments, the special clothing is fully closed, so that the limitation degree is increased. Taking the extravehicular space suit as an example, the extravehicular space suit (the extravehicular space suit for short) mainly has the function of supporting and completing a large number of simple to complex various extravehicular activity tasks such as extravehicular scientific experiments, payload maintenance, space station assembly and maintenance and the like.

The intelligent voice interaction technology is combined with the cabin outer suit system, so that the man-machine effect of the cabin outer suit is improved, and the development trend is at present. However, the man-machine interaction scene of the extravehicular suit is greatly different from the traditional man-machine interaction scene, and mainly comprises:

1. the environment for sound transmission is special, the outside-cabin clothes are closed cavities formed by flexible materials and aluminum materials, the air pressure and the humidity in the cavities are obviously different from the surface environment, and the sound can be reflected and absorbed to different degrees when being transmitted in the cavities.

2. The noise is more, and the noise that equipment such as fan, pump that the outdoor clothing carried produced can produce the serious influence to the quality of communication voice.

3. Need the sound source location, when carrying out cabin outer work, the staff needs fix a position the sound source, promotes work efficiency.

Therefore, in view of the above problems, it is necessary to propose a further solution to solve at least one of the problems.

Disclosure of Invention

The invention aims to provide a voice interaction system and method for special clothes, which aim to overcome the defects in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a special garment voice interaction system is characterized by comprising:

the microphone array is arranged in the special garment and used for acquiring voice signals;

the voice recognition module is used for recognizing the voice signal and outputting a recognition result;

the voice alarm module is used for acquiring an external sound source signal and positioning information of the external sound source signal;

the control module is used for outputting a control signal according to the recognition result so as to control equipment operation, and outputting a first alarm voice according to the external sound source signal or outputting a second alarm voice according to the external sound source signal and the positioning information of the external sound source signal;

the voice synthesis module is used for receiving the control signal and outputting synthesized voice;

and the stereo broadcasting device is arranged in the special clothing and is used for receiving and playing the first warning voice, the second warning voice or the synthesized voice.

In a preferred embodiment of the present invention, the speech recognition module comprises:

the acoustic model training unit is used for establishing a voice model according to a corpus, combined modeling based on a deep neural network and an HMM (hidden Markov model) and combined with a discriminative training technology;

the signal processing unit is used for extracting voice characteristic parameters from the received voice signals;

and the voice recognition unit is used for matching the voice characteristic parameters of the voice signals with the voice model, the language model and the dictionary and outputting a recognition result.

In a preferred embodiment of the present invention, the speech synthesis module comprises:

the modeling unit is used for establishing a speech synthesis model according to the sound library and based on HMM training;

the text analysis unit is used for extracting context-dependent HMM sequence decision information according to the recognition result and the speech synthesis model and generating prosodic parameters;

and the voice synthesis unit is used for generating synthesized voice according to the HTS + Stright algorithm and the prosody parameters.

In a preferred embodiment of the present invention, the voice alarm module includes:

a sound source acquisition unit for acquiring an external sound source signal;

an azimuth selecting unit for acquiring azimuth information of the external sound source according to the external sound source signal;

a distance detection unit for acquiring distance information of the external sound source according to the external sound source signal;

the convolution calculation unit is used for processing the azimuth information and the distance information in a segmented mode and generating a positioning signal;

and the voice alarm generating unit is used for generating a first alarm voice according to the external sound source signal and generating a second alarm voice according to the external sound source signal and the positioning signal.

The invention also provides a special garment voice interaction method, which comprises the following steps:

s1, acquiring a voice signal;

s2 recognizing the voice signal and outputting a recognition result, the recognition result including at least text information;

s3, acquiring an external sound source signal and positioning information of the external sound source signal;

s4 outputting a control signal according to the recognition result to control the operation of the device, and outputting a first warning voice according to the external sound source signal or outputting a second warning voice according to the external sound source signal and the location information of the external sound source signal;

s5 receiving the control signal and outputting a synthesized voice;

s6 plays the first warning voice, the second warning voice or the synthesized voice.

In a preferred embodiment of the present invention, the recognizing the voice signal and outputting the recognition result in step S2 includes:

and establishing a voice model according to a corpus, combined modeling based on a deep neural network and an HMM (hidden Markov model) and combined with a discriminative training technology, and matching the voice signal with the voice model to output a recognition result.

In a preferred embodiment of the present invention, the step S3 of acquiring the external sound source signal and the positioning information of the external sound source signal includes:

HRTF technology is adopted, based on the space positioning capability of an auditory system, the sound source is positioned in two aspects of direction and distance, and segmented processing is adopted in the convolution calculation stage.

In a preferred embodiment of the present invention, the step S5 of receiving the control signal and outputting a synthesized voice includes:

and generating the synthesized voice according to the HTS + Stright algorithm and the prosodic parameters of the voice signal.

In a preferred embodiment of the present invention, the step S1 of acquiring the voice signal includes:

and acquiring voice signals of personnel in the special clothes by adopting a microphone array.

In a preferred embodiment of the present invention, the step S6 playing the first warning voice, the second warning voice or the synthesized voice includes:

and playing the first warning voice, the second warning voice or the synthesized voice by adopting a stereo broadcasting device arranged in the special clothes.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention is suitable for voice recognition in special environment, and the recognition rate of the fan or the air conditioning valve in the special clothes can reach more than 98 percent no matter whether the fan or the air conditioning valve is opened or not by adopting the microphone array to effectively inhibit the environmental noise.

(2) The invention can sense the azimuth through voice, is particularly suitable for space environment, and reflects the azimuth sense of the target sound source under the complex motion condition that a plurality of aircrafts simultaneously revolve and rotate or when astronauts carry out extravehicular operation, so that personnel in the suit can subjectively sense the relative position of the target or the operating personnel, and the working efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of a speech recognition module according to the present invention;

FIG. 3 is a schematic diagram of a speech synthesis module according to the present invention;

FIG. 4 is a diagram of a voice alarm module according to the present invention.

Specifically, 100, a microphone array; 200. a voice recognition module; 300. a voice alarm module; 400. a control module; 500. a speech synthesis module; 600. a stereophonic broadcasting device.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore, should not be taken as limiting the scope of the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the invention, the meaning of "a plurality" is two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

As shown in fig. 1, a special garment voice interaction system includes a microphone array 100, a voice recognition module 200, a voice alarm module 300, a control module 400, a voice synthesis module 500, and a stereo broadcasting device 600.

Specifically, the microphone array 100 is disposed in a special garment and is used to acquire a voice signal. The speech recognition module 200 is used for recognizing the speech signal and outputting a recognition result. The voice warning module 300 is used for acquiring an external sound source signal and positioning information of the external sound source signal. The control module 400 is configured to output a control signal according to the recognition result to control the operation of the device, and output a first warning voice according to the external sound source signal, or output a second warning voice according to the external sound source signal and the positioning information of the external sound source signal. The speech synthesis module 500 is configured to receive the control signal and output a synthesized speech. The stereo broadcasting device 600 is disposed in the special garment and is configured to receive and play the first warning voice, the second warning voice or the synthesized voice.

As shown in fig. 1, the communication voice a2 of the far-end voice communication is sent by the control module 400 to the speech synthesis module 500 for communication voice broadcast a 3. The microphone array 100 picks up the voice b1, the voice b1 is subjected to noise reduction and recognition by the voice recognition module 200, the voice b21 after noise reduction is transmitted to the control module 400 to be sent out, and the recognition result is reported to b22 to the control module 400. The command for recognizing the opening can be realized by a voice wake-up mode, and also can be realized by a mode of recognizing the command for issuing b 01. The recognition result may or may not be sent to the speech synthesis module 500 for broadcasting. The control module 400 issues a voice synthesis instruction to c2 to the voice synthesis module 500, performs synthetic voice broadcasting c3 after synthesis, and feeds back whether the synthesis is normal to the upper layer application. The control module 400 issues the warning instruction to d21 or sends the audio d22 containing the azimuth information to the voice warning module 300, that is, the sound source acquisition unit in the voice warning module 300 first sends the acquired external sound source signal to the control module 400, the control module 400 determines to output the first warning voice or the second warning voice, and sends the determination result and the external sound source signal to the voice warning module 300, and the voice warning module 300 plays the fixed azimuth audio (the first warning voice) according to the instruction or synthesizes and broadcasts the audio (the second warning voice) with the azimuth sense in real time after analyzing the azimuth data and the audio.

The system adopts the microphone array 100 and the in-garment stereo broadcasting device 600, realizes the non-contact between the pickup and broadcasting device and personnel, and avoids discomfort caused by collision, friction, sultry and the like.

The system combines the microphone array 100 technology, the voice recognition technology, the voice synthesis technology and the virtual sound 3D alarm technology, can effectively inhibit noise generated in special environment of the extravehicular suit, accurately identifies voice instructions sent by operators, can enable the system to synthesize voice and have direction information, avoids frequent joint movement of the operators, reduces workload, and greatly improves man-machine cooperation efficiency. Specifically, the system takes the microphone array 100 as input, converts the input into an instruction through the voice recognition module 200, reports the instruction to the control module 400, the control module 400 receives the recognized control instruction, and can operate the equipment inside the cabin outer suit, such as modulation and display, query and the like, and meanwhile, the control module 400 issues an alarm and synthesis instruction, and the three modes of voice synthesis, ordinary alarm and 3D voice alarm can be selected to be broadcasted through the stereo broadcasting device 600 in different forms.

As shown in fig. 2, the speech recognition module 200 includes an acoustic model training unit, a signal processing unit, and a speech recognition unit. Specifically, the acoustic model training unit is used for establishing a speech model according to a corpus and a combined modeling based on a deep neural network and an HMM, and combined with a discriminative training technology. The signal processing unit is used for extracting voice characteristic parameters from the received voice signals. The voice recognition unit is used for matching the voice characteristic parameters of the voice signals with the voice model, the language model and the dictionary and outputting a recognition result.

As shown in fig. 3, the speech synthesis module 500 includes a modeling unit, a text analysis unit, and a speech synthesis unit. Specifically, the modeling unit is used for establishing a speech synthesis model according to the sound library and based on HMM training. And the text analysis unit is used for extracting context-dependent HMM sequence decision information according to the recognition result and the speech synthesis model and generating prosodic parameters. The speech synthesis unit is used for generating synthesized speech according to the HTS + Stright algorithm and the prosodic parameters.

As shown in fig. 4, the voice alarm module 300 includes a sound source acquiring unit, an azimuth selecting unit, a distance detecting unit, a convolution calculating unit, and a voice alarm generating unit. Specifically, the sound source acquisition unit is used to acquire an external sound source signal, i.e., a target. The azimuth selection unit is used for acquiring azimuth information of the external sound source according to the external sound source signal. The distance detection unit is used for acquiring distance information of the external sound source according to the external sound source signal. The convolution calculating unit is used for processing the azimuth information and the distance information in a segmented mode and generating a positioning signal. The voice alarm generating unit is used for generating a first alarm voice according to the external sound source signal and generating a second alarm voice according to the external sound source signal and the positioning signal.

A special garment voice interaction method comprises the following steps:

s1 acquires a voice signal.

Preferably, a microphone array 100 is used to acquire the speech signal of the person inside the specialty garment.

S2 recognizes the speech signal and outputs a recognition result, the recognition result including at least text information.

Preferably, a speech model is established according to the corpus and the joint modeling based on the deep neural network and the HMM, and a discriminative training technology is combined, and the speech signal is matched with the speech model to output a recognition result. The corpus is a large amount of general corpora and part of simulation scene corpora collected according to the using environment. Adaptation to a specific acoustic environment on the basis of ensuring coverage of (secondary) linguistic phenomena in a broad-spectrum sense is achieved by the speech recognition module 200.

S3 acquires the external sound source signal and the localization information of the external sound source signal.

Preferably, by adopting the HRTF technology, the sound source is positioned in two aspects of direction and distance based on the space positioning capability of the auditory system, and the sound field auditory system can realize positioning in the case of a single sound source and in the case of multiple sound sources. In the convolution calculation stage, segmentation processing is adopted, so that the calculation amount is reduced.

S4 outputs a control signal for controlling the operation of the device according to the recognition result, and outputs a first warning voice according to the external sound source signal or outputs a second warning voice according to the external sound source signal and the location information of the external sound source signal.

S5 receives the control signal and outputs a synthesized speech.

Preferably, the synthesized speech is generated from the HTS + Stright algorithm and prosodic parameters of the speech signal. The HTS + Stright algorithm is suitable for efficient parametric synthesis in low power scenarios. S6 plays the first warning voice, the second warning voice or the synthesized voice.

Preferably, the first warning voice, the second warning voice or the synthesized voice is played by using the stereo broadcasting device 600 arranged in the special clothes.

To sum up, the system combines the microphone array technology, the voice recognition technology, the voice synthesis technology and the virtual sound 3D warning technology, can effectively restrain noise generated in the special environment of the cabin outer suit, accurately recognizes voice instructions sent by operators, meanwhile, enables the system to synthesize voice and have direction information, avoids frequent joint movement of the operators, reduces workload, and greatly improves human-computer cooperation efficiency.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A special garment voice interaction system is characterized by comprising:

2. The special garment voice interaction system of claim 1, wherein the voice recognition module comprises:

3. The special garment voice interaction system of claim 1, wherein the voice synthesis module comprises:

4. The special garment voice interaction system as claimed in claim 1, wherein the voice alarm module comprises:

a sound source acquisition unit for acquiring an external sound source signal;

5. A special garment voice interaction method is characterized by comprising the following steps:

s1, acquiring a voice signal;

s5 receiving the control signal and outputting a synthesized voice;

6. The special garment voice interaction method according to claim 5, wherein the step S2 of recognizing the voice signal and outputting the recognition result comprises:

7. The special garment voice interaction method according to claim 5, wherein the step S3 of acquiring the external sound source signal and the positioning information of the external sound source signal comprises:

8. The special garment voice interaction method according to claim 5, wherein the step S5 is receiving the control signal and outputting synthesized voice, and comprises:

9. The special garment voice interaction method according to claim 5, wherein the step S1 of acquiring a voice signal comprises:

10. The special garment voice interaction method according to claim 5, wherein the step S6 playing the first warning voice, the second warning voice or the synthesized voice comprises: