CN112164389A - Multi-mode speech recognition calling device and control method thereof - Google Patents

Multi-mode speech recognition calling device and control method thereof Download PDF

Info

Publication number
CN112164389A
CN112164389A CN202010984329.1A CN202010984329A CN112164389A CN 112164389 A CN112164389 A CN 112164389A CN 202010984329 A CN202010984329 A CN 202010984329A CN 112164389 A CN112164389 A CN 112164389A
Authority
CN
China
Prior art keywords
mode
module
voice
recognition
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010984329.1A
Other languages
Chinese (zh)
Other versions
CN112164389B (en
Inventor
吴传贵
阚艳
徐贵力
周勇军
李珊珊
胡伟
韩梁
张小辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Run Wuhu Machinery Factory
Original Assignee
State Run Wuhu Machinery Factory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Run Wuhu Machinery Factory filed Critical State Run Wuhu Machinery Factory
Priority to CN202010984329.1A priority Critical patent/CN112164389B/en
Publication of CN112164389A publication Critical patent/CN112164389A/en
Application granted granted Critical
Publication of CN112164389B publication Critical patent/CN112164389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Selective Calling Equipment (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to the field of multi-mode voice recognition and transmission, in particular to a multi-mode voice recognition transmission device and a control method thereof, wherein the multi-mode voice recognition transmission device comprises a power module and also comprises: the system comprises an FPGA central processing module, a 2DSP operation processing module, an audio and video input and output module, a man-machine communication control module and a software program module, wherein the specific control method comprises the following steps: step 8.1: initializing and self-detecting; step 8.2; judging whether the device is normal or not; step 8.3; judging whether the device is updated; step 8.4; judging whether the setting mode is automatic or not; step 8.5; noise and brightness in environmental steps; step 8.6; setting a working mode; step 8.7; judging whether the step X is in a mode of 'X', and selecting 1-5 values; step 8.8; executing a voice over talk mode; step 8.9; outputting voice information; step 8.10; judging whether to interrupt; step 8.11; judging whether to quit; step 8.12: withdrawing; the invention realizes the speech transmission of multi-mode speech recognition, and improves the real-time performance and the accuracy of the speech transmission.

Description

Multi-mode speech recognition calling device and control method thereof
Technical Field
The invention relates to the field of multi-mode voice recognition and transmission, in particular to a multi-mode voice recognition transmission device and a control method thereof.
Background
In the field of voice recognition, most of the conventional situations are realized by adopting direct voice processing or throat voice transmission or lip reading technology, and the conventional earphone and the bone conduction earphone based on the bone conduction technology can meet the voice communication requirements in many occasions under the interference of general noise after noise reduction and enhancement processing are carried out. With the continuous application of general airplanes and engines, when a test flight and a test run are carried out, the aero-engine is a main source of field noise, and particularly, the large aero-engine has wide noise frequency range and high decibel, so that the normal work communication of field workers is seriously influenced. At present, the noise reduction earphone greatly reduces the interference of noise to workers after reducing the noise, but still cannot meet the requirement of mutual communication of operators, can only communicate through gestures or other modes, and cannot express and transmit more information in time.
In the process of checking the airplane or the aeroengine, an operator needs to carry out effective information communication, and according to the traditional voice communication method and device, the use requirements are difficult to meet only through noise reduction earphones, gestures, semaphores and other modes, so that a new technical method needs to be adopted, the effectiveness and the scientificity of voice recognition and voice transmission are improved, and efficient and safe implementation of airplane or aeroengine checking is promoted.
For example, chinese patent application No. 201910032244.0 discloses an intelligent headphone and an earphone system, which can implement corresponding function operations according to a voice instruction, does not require manual key pressing, is convenient to operate, and can improve user experience, and the system includes a microphone, a voice processing module, a central processing module, an audio processing module, and a speaker. The disadvantage is that the system is only used for voice command control and does not involve multi-mode voice recognition and efficient communication under loud noise.
For example, chinese patent application No. 201910012835.1 discloses an active noise reduction headphone with a hybrid structure, a noise reduction method, and a storage medium, which can select the most suitable noise reduction system coefficient and track the change of noise signals more quickly and accurately, thereby greatly improving the noise reduction effect. The mixed structure active noise reduction earphone comprises: an active noise control system, a reference microphone, and a muffled microphone. The system has the disadvantages that the system only carries out loop iteration processing to select the coefficient of the noise reduction system, and does not relate to multi-mode speech recognition and effective communication under loud noise.
For example, the chinese patent application No. 201810422275.2 discloses a lip detection and reading method based on cascade feature extraction, which can improve the speed and accuracy of lip reading, and the method includes lip region detection, lip region extraction, dimension extraction, lip region reading, and the like. The method has the disadvantages that multi-mode voice recognition and effective communication under high noise are not involved only by multi-stage extraction and dimension reduction of lip region image features.
For example, chinese patent No. 201611086527.6 discloses an audio enhancement processing module for a laryngeal microphone, which comprises a respiratory sound signal processing board, a power supply, and an audio output switch. The disadvantage is that the module improves the definition and the recognition of the laryngeal microphone and does not relate to the multi-mode speech recognition and effective communication under the condition of large noise.
A land-air communication voice recognition method based on a BilSTM/CTC model is disclosed in No. 2 293-first-year 299 of Signal processing published in No. 2 of 2019, and mainly aims at the characteristics of land-air communication languages of civil aviation, a BilSTM/CTC model is obtained by training a BilSTM network, and voice recognition of land-air communication of civil aviation is realized by utilizing an acoustic model, a language model and a land-air communication dictionary. The system has the defect that the system only realizes that the application of the enhanced acoustic model reduces the speech recognition of the land-air conversation to 5.53 percent in terms of word recognition error, and does not relate to multi-mode speech recognition and effective communication under loud noise.
The 'high technology communication' published in 2019, 3 rd stage 287-294, discloses a cerebral palsy rehabilitation training system designed with expression and voice interaction based on human-computer interaction (HCI), comprehensively adopts a mode of combining an upper computer and a lower computer, the lower computer adopts a 51-single chip microcomputer to carry out main body driving, a voice acquisition module adopts an LD3320 voice chip, and connects the lower computer programmed by C language with the upper computer compiled by Labview in a serial port communication mode, so as to realize the identification and matching of voice semantic rules and finish the evaluation and statistics of test. The system has the disadvantages that the system only designs expression and voice interaction man-machine interaction, and multi-mode voice recognition and effective communication under loud noise are not involved.
Therefore, aiming at the research of voice recognition design, the research mainly aims at the aspects of head-wearing earphones and earphone systems, lip detection and reading methods, laryngeal microphones and related voice recognition quality improvement, and the research mainly aims at functions of cerebral palsy rehabilitation training systems, civil aviation land-air communication, voice instruction control and the like, but the research on multi-mode voice recognition and effective microphone sending methods and devices under high noise is less. There is a need to develop a multi-mode speech recognition transmitter and a control method thereof under high noise.
Disclosure of Invention
In order to solve the above problems, the present invention provides a multi-mode speech recognition transmitting device and a control method thereof.
A multi-mode speech recognition transmitter, including the power module which realizes the conversion function of the working voltage of the device by supplying power, also includes:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
2, the DSP operation processing module is connected with the FPGA central processing module and the power supply module and is used for realizing the lip segmentation, feature extraction, lip phone identification and fusion identification operation processing functions of the video digital signals;
the audio and video input and output module is connected with the FPGA central processing module and the power supply module and outputs the processed and fused audio signals through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power supply module and is used for finishing power supply switching, mode selection, working state display, light induction and LED light-emitting control;
and the software program module is connected with the FPGA central processing module and is used for finishing the fusion recognition and decision output of the audio and the video.
The FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit which is connected with the SAA7111 digital decoder and an audio and video input and output module and is used for finishing the input buffer memory of pre-stage data and the output buffer memory of post-stage data, a DSP unit which is mainly connected with the audio and video input and output module through a virtual DSP and is used for providing the input and output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with a GPIO and a man-machine communication control module and is used for realizing the signal control between internal function modules, and a communication and data buffer part which is used as the FPGA central processing module, the system comprises an SRIO communication and data cache module used for providing a high-speed data processing function of an FPGA central processing module, one of external interface connecting circuits used as the FPGA central processing module, and a signal configuration and integration module used for realizing signal configuration and integration.
The 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through an SRIO 1X interface.
The 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device, optimize the image information processing capacity and the expandability of the system and realize the voice recognition, the lip language recognition and the fusion decision of the transmitter.
The audio and video input and output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal line and used for providing an original video signal source for lip-talk recognition, a TLV320AIC23B sound collection chip connected with a DSP unit in the FPGA central processing module through IIC and McASP interfaces and used for receiving audio signals to realize chip control and data transmission, a bone-sensing sensor and a sound sensor which provide conventional audio and bone-conduction audio signals for the TLV320AIC23B sound collection chip through audio signal lines and an SDRAM1 unit used for providing an extended external data storage space for the DSP unit, wherein the TLV320AIC23B sound collection chip is used for outputting the processed and fused audio signals through a synthetic audio output circuit.
The human-computer communication control module comprises a Cy7C68013A communication controller, a key switch, a light induction control circuit and an LED light-emitting control circuit, wherein the Cy7C68013A communication controller is connected with a signal configuration integration module in the FPGA central processing module through a GPIO (general purpose input/output) to provide a USB communication function, communicates with an external training control computer to finish data downloading and receiving state reply after training, and is connected with a CPLD (complex programmable logic device) unit in the FPGA central processing module through the GPIO to respectively finish power switching, mode selection, working state display, light induction and LED light-emitting control.
The key switches comprise a power switch, control keys, a light knob, a liquid crystal display screen, a mouse and a numeric keyboard.
The software program module comprises an upper computer training control software module for realizing training of the recognition algorithm and downloading and uploading of data, an embedded system main flow module which is interacted with the upper computer training control software module and used for finishing initialization, self-detection, storage and prompt of fault states, data updating and USB communication, and an embedded system algorithm module used for finishing fusion recognition and decision output of audio and video.
The audio recognition of the embedded system algorithm module consists of audio acquisition, preprocessing, vector quantization, voice synthesis and voice recognition, and the video recognition consists of video acquisition, preprocessing, lip segmentation, lip feature extraction and visual recognition.
A control method of a multimode speech recognition transmitter comprises the following steps:
step 8.1: initialization and self-detection: initializing a multi-mode voice recognition call device and a control program, carrying out hardware self-detection on the device, acquiring the working state of each module of the device, and executing the next step 8.2 after the working state is finished;
step 8.2: judging whether the device is normal: according to the self-detection module of the device, the data returned from each module is comprehensively compared to give out whether the data is normal or not, when the data is 'fault', the fault is prompted, and the step 8.11 is skipped to whether the data is quitted, and when the data is 'normal', the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with a training control computer through a USB, data updating can be carried out, updating contents mainly comprise an identification algorithm and system optimization, when updating is needed, an updating program is executed, otherwise, the next step 8.4 is executed;
step 8.4: judging whether the setting mode is automatic or not: the system sets automatic and manual setting modes through a manual/automatic key, defaults to automatic, and directly shifts to the next step 8.5; when the mode is a manual mode, the working mode is manually selected, the operation mode is skipped to a setting working module, and the working mode is set;
step 8.5: environmental step noise and brightness: collecting and processing: automatically setting a working mode according to the noise and light conditions collected by the device, and selecting the mode '1' when the noise is less than a reference threshold value 1; selecting a mode "2" or "3" when the noise is greater than or equal to a reference threshold value 1 and less than a reference threshold value 2, and selecting a mode "4" or "5" when the noise is greater than the reference threshold value 2; the brightness is effective only when the LED lamp works in the modes of 3, 4 and 5, when the brightness is smaller than a reference brightness threshold value, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the treatment is finished, the next step 8.6 is executed;
step 8.6: setting a working mode: the automatic operating mode is set by the environmental step noise and brightness: the manual working mode setting is mainly selected through a working mode selection key of a man-machine communication control module, the initial working mode state of the system is '1', and the last working mode is set as the initial state after the system works; the working mode is changed in a circulating mode in sequence every time the key is pressed, after the key is pressed for 3 seconds and the working mode setting is automatically completed, the next step 8.7 is executed, and in addition, the working state of the LED illuminator can be set through the LED switch key;
step 8.7: judging whether the step X value is 1 to 5 in the mode X: when "1", the normal audio speaking voice mode is to be executed; when it is "2", the normal combined larynx speaking voice mode is executed; when it is "3", the normal combined lip reading voice mode will be performed; when the voice is '4', a voice mode of throat and lip combined reading is executed; when the number is '5', a combined voice sending mode of the three is executed; respectively executing different voice transmission mode steps 8.8 according to the mode selection;
step 8.8: performing a voice over talk mode: according to the current working mode, executing a corresponding voice sending mode, which specifically comprises the following steps:
firstly, in a conventional audio voice transmission mode, only a sound sensor works effectively, and a bone feeling sensor and video acquisition do not participate in voice recognition;
secondly, a conventional combined larynx speaking voice mode mainly aims at effectively working of a voice sensor and a bone sensation sensor, and video collection does not participate in voice recognition;
thirdly, a voice mode of conventional combined lip reading and transmitting is mainly that a sound sensor and a video are effectively collected, and a bone feeling sensor does not participate in voice recognition;
fourthly, a voice mode of reading and transmitting voice by combining the larynx and the lip is mainly used for effectively working of a bone feeling sensor and video acquisition, and a voice sensor does not participate in voice recognition;
fifthly, a voice mode of the combined voice transmission of the three is mainly that the sound sensor, the bone feeling sensor and the video acquisition effectively work at the same time, comprehensive fusion recognition is carried out, and then the next step 8.9 is executed;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether interruption occurs: checking whether the external is interrupted, if not, skipping to the step 8.7 of judging whether the external is in the mode X, otherwise, executing the next step 8.11;
step 8.11: judging whether to quit: checking whether an exit signal exists or not, and if not, skipping to a step 8.3 of judging whether the device is normal or not, otherwise, executing a next step 8.12;
step 8.12: and (3) exiting: and exiting the program and ending the control program.
The invention has the beneficial effects that: the invention realizes the multi-mode voice recognition of the sending, mode selection, conventional audio and throat sending, lip phone recognition and fusion sending capabilities, is used for the operation personnel to carry out the voice sending through multi-mode switching and combination under the condition of high noise when an airplane or an aircraft engine is checked so as to better ensure the effectiveness and the scientificity of the communication information capacity and the communication quality, particularly the voice recognition and the lip phone recognition and fusion of audio and video signals, and improves the instantaneity, the accuracy and the practicability of the voice sending through the development and the design of embedded software.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a schematic structural view of the present invention;
FIG. 2 is a block diagram of a software program module according to the present invention;
fig. 3 is a schematic view of the flow structure of the control method of the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below.
As shown in fig. 1 to fig. 3, a multimode speech recognition microphone device includes a power module for implementing a device operating voltage conversion function by supplying power, and further includes:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
2, the DSP operation processing module is connected with the FPGA central processing module and the power supply module and is used for realizing the lip segmentation, feature extraction, lip phone identification and fusion identification operation processing functions of the video digital signals;
the audio and video input and output module is connected with the FPGA central processing module and the power supply module and outputs the processed and fused audio signals through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power supply module and is used for finishing power supply switching, mode selection, working state display, light induction and LED light-emitting control;
and the software program module is connected with the FPGA central processing module and is used for finishing the fusion recognition and decision output of the audio and the video.
The invention realizes the multi-mode voice recognition of the sending, mode selection, conventional audio and throat sending, lip phone recognition and fusion sending capabilities, is used for the operation personnel to carry out the voice sending through multi-mode switching and combination under the condition of high noise when an airplane or an aircraft engine is checked so as to better ensure the effectiveness and the scientificity of the communication information capacity and the communication quality, particularly the voice recognition and the lip phone recognition and fusion of audio and video signals, and improves the instantaneity, the accuracy and the practicability of the voice sending through the development and the design of embedded software.
The FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit which is connected with the SAA7111 digital decoder and an audio and video input and output module and is used for finishing the input buffer memory of pre-stage data and the output buffer memory of post-stage data, a DSP unit which is mainly connected with the audio and video input and output module through a virtual DSP and is used for providing the input and output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with a GPIO and a man-machine communication control module and is used for realizing the signal control between internal function modules, and a communication and data buffer part which is used as the FPGA central processing module, the system comprises an SRIO communication and data cache module used for providing a high-speed data processing function of an FPGA central processing module, one of external interface connecting circuits used as the FPGA central processing module, and a signal configuration and integration module used for realizing signal configuration and integration.
The FPGA central processing module is adopted to combine the high-performance FPGA and SRIO communication with the data caching function, so that an SAA7111 digital decoder, an FIFO unit, a DSP unit, a CPLD unit, an SRIO communication and data caching module and a signal configuration integration module can be generated, the comprehensive control, audio and video data processing, high-speed data transmission and signal configuration integration capabilities of the transmitter are realized, the flexible advantages of the quasi-FPGA are effectively exerted, and the working efficiency and reliability of the circuit are improved; the complexity of the device is also simplified, so that the circuit is simplified and the hardware cost is reduced.
The FPGA central processing module is respectively connected with the 2DSP operation processing module, the audio and video input and output module, the man-machine communication control module and the power module, is interactively controlled with the 2DSP operation processing module through the 2-path SRIO 1X interface, is respectively connected with the audio and video input and output module through the video signal interface, the IIC interface and the McASP interface, is mainly controlled through the GPIO interface with the man-machine communication control module, and various power voltages required by work are provided by the power module.
The FIFO unit is used as a data buffer and is mainly connected with the SAA7111 digital decoder and the DSP unit.
The SAA7111 digital decoder is used as a video acquisition processing unit and is respectively connected with a video acquisition device, a CPLD unit and an FIFO unit.
The 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through an SRIO 1X interface.
The DSP unit is characterized in that a virtual DSP is mainly connected with an audio/video input/output module outwards, and the TLV320AIC23B sound acquisition chip is controlled through an IIC interface; the McASP interface is used for receiving data, the SDRAM1 unit is connected through the EMIF interface to acquire and process audio data, the internal part of the MCU is mainly connected with the FIFO unit and the CPLD unit, and the input and output functions of external audio signals of the FPGA central processing module are provided.
The CPLD unit is used as one of external interface connecting circuits of the FPGA central processing module, is connected with the human-computer communication control module through the GPIO, provides human-computer interaction control and light induction of the FPGA central processing module and LED light-emitting control functions, and is internally connected with the SAA7111 digital decoder, the DSP unit and the signal configuration integration module.
The 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device, optimize the image information processing capacity and the expandability of the system and realize the voice recognition, the lip language recognition and the fusion decision of the transmitter.
The voice recognition, lip language recognition and fusion decision-making capability of the voice transmitter are realized by adopting the 2DSP operation processing module, two TMS320C6455 processors are selected, high-speed data interaction is carried out through SRIO 1X and FPGA central processing, parallel processing can be carried out, large calculation amount operation is completed quickly, the requirements of real-time performance and recognition rate of the device are met, and meanwhile, the image information processing capability and the expandability of the system are optimized.
The SRIO communication and data cache module is used as a communication and data cache part of the FPGA central processing module, is interactively controlled with the 2DSP operation processing module through the 2-path SRIO 1X interface, and is internally connected with the signal configuration and integration module.
The signal configuration integration module is used as one of external interface connecting circuits of the FPGA central processing module, is connected with the man-machine communication control module through a GPIO (general purpose input/output) to provide a USB (universal serial bus) communication function of the FPGA central processing module
The audio and video input and output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal line and used for providing an original video signal source for lip-talk recognition, a TLV320AIC23B sound collection chip connected with a DSP unit in the FPGA central processing module through IIC and McASP interfaces and used for receiving audio signals to realize chip control and data transmission, a bone-sensing sensor and a sound sensor which provide conventional audio and bone-conduction audio signals for the TLV320AIC23B sound collection chip through audio signal lines and an SDRAM1 unit used for providing an extended external data storage space for the DSP unit, wherein the TLV320AIC23B sound collection chip is used for outputting the processed and fused audio signals through a synthetic audio output circuit.
The voice transmitter adopts the audio and video input and output module to realize the audio and video acquisition, the sound information preprocessing and the synthesized audio output capacity of the voice transmitter, adopts the TLV320AIC23B sound acquisition chip to carry out input and output on a plurality of paths of audio signals, has programmable gain adjustment, meets the input and output requirements of the voice transmitter on high audio performance, also meets the requirements of very low energy consumption and improves the working energy efficiency ratio of the voice transmitter.
The human-computer communication control module comprises a Cy7C68013A communication controller, a key switch, a light induction control circuit and an LED light-emitting control circuit, wherein the Cy7C68013A communication controller is connected with a signal configuration integration module in the FPGA central processing module through a GPIO (general purpose input/output) to provide a USB communication function, communicates with an external training control computer to finish data downloading and receiving state reply after training, and is connected with a CPLD (complex programmable logic device) unit in the FPGA central processing module through the GPIO to respectively finish power switching, mode selection, working state display, light induction and LED light-emitting control.
The man-machine control of the transmitter and the USB communication capability of the upper computer are realized by adopting the man-machine communication control module, the Cy7C68013A communicator is adopted, the USB2.0 protocol is supported based on the interface with the embedded microprocessor, the data transmission of the USB data port can be completed by simply matching some registers and memories, the design of the program is simplified, the transmission rate is improved, and the reliability is improved.
The key switches comprise a power switch, control keys, a light knob, a liquid crystal display screen, a mouse and a numeric keyboard.
The software program module comprises an upper computer training control software module for realizing training of the recognition algorithm and downloading and uploading of data, an embedded system main flow module which is interacted with the upper computer training control software module and used for finishing initialization, self-detection, storage and prompt of fault states, data updating and USB communication, and an embedded system algorithm module used for finishing fusion recognition and decision output of audio and video.
The software program module is adopted to realize the software function control and the operational capability of the transmitter, the embedded system and the PC system are adopted, the embedded system is used for voice recognition, fusion decision and the like, the PC system finishes the codebook training and the voice template training, the division of labor between the embedded system and the PC system is clear, the requirements of recognition rate and recognition time are met, the system is simplified, and the hardware cost is reduced.
The audio recognition of the embedded system algorithm module consists of audio acquisition, preprocessing, vector quantization, voice synthesis and voice recognition, and the video recognition consists of video acquisition, preprocessing, lip segmentation, lip feature extraction and visual recognition.
The embedded system algorithm module is the core of the software program module.
A control method of a multimode speech recognition transmitter comprises the following steps:
step 8.1: initialization and self-detection: initializing a multi-mode voice recognition call device and a control program, carrying out hardware self-detection on the device, acquiring the working state of each module of the device, and executing the next step 8.2 after the working state is finished;
step 8.2: judging whether the device is normal: according to the self-detection module of the device, the data returned from each module is comprehensively compared to give out whether the data is normal or not, when the data is 'fault', the fault is prompted, and the step 8.11 is skipped to whether the data is quitted, and when the data is 'normal', the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with a training control computer through a USB, data updating can be carried out, updating contents mainly comprise an identification algorithm and system optimization, when updating is needed, an updating program is executed, otherwise, the next step 8.4 is executed;
step 8.4: judging whether the setting mode is automatic or not: the system sets automatic and manual setting modes through a manual/automatic key, defaults to automatic, and directly shifts to the next step 8.5; when the mode is a manual mode, the working mode is manually selected, the operation mode is skipped to a setting working module, and the working mode is set;
step 8.5: environmental step noise and brightness: collecting and processing: automatically setting a working mode according to the noise and light conditions collected by the device, and selecting the mode '1' when the noise is less than a reference threshold value 1; selecting a mode "2" or "3" when the noise is greater than or equal to a reference threshold value 1 and less than a reference threshold value 2, and selecting a mode "4" or "5" when the noise is greater than the reference threshold value 2; the brightness is effective only when the LED lamp works in the modes of 3, 4 and 5, when the brightness is smaller than a reference brightness threshold value, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the treatment is finished, the next step 8.6 is executed;
step 8.6: setting a working mode: the automatic operating mode is set by the environmental step noise and brightness: the manual working mode setting is mainly selected through a working mode selection key of a man-machine communication control module, the initial working mode state of the system is '1', and the last working mode is set as the initial state after the system works; the working mode is changed in a circulating mode in sequence every time the key is pressed, after the key is pressed for 3 seconds and the working mode setting is automatically completed, the next step 8.7 is executed, and in addition, the working state of the LED illuminator can be set through the LED switch key;
step 8.7: judging whether the step X value is 1 to 5 in the mode X: when "1", the normal audio speaking voice mode is to be executed; when it is "2", the normal combined larynx speaking voice mode is executed; when it is "3", the normal combined lip reading voice mode will be performed; when the voice is '4', a voice mode of throat and lip combined reading is executed; when the number is '5', a combined voice sending mode of the three is executed; respectively executing different voice transmission mode steps 8.8 according to the mode selection;
step 8.8: performing a voice over talk mode: according to the current working mode, executing a corresponding voice sending mode, which specifically comprises the following steps:
a conventional audio voice transmission mode, wherein only the sound sensor works effectively, and the bone feeling sensor and the video acquisition do not participate in voice recognition;
seventhly, a conventional combined larynx speaking voice mode mainly enables a sound sensor and a bone feeling sensor to work effectively, and video collection does not participate in voice recognition;
eighthly, a voice mode of conventional combined lip reading and transmitting is mainly that a voice sensor and a video acquisition work effectively, and a bone feeling sensor does not participate in voice recognition;
ninthly, a voice mode of reading and transmitting voice by a throat combined lip, wherein a bone feeling sensor and video acquisition work effectively, and a voice sensor does not participate in voice recognition;
ten, three combined voice sending modes, mainly the sound sensor, the bone feeling sensor and the video acquisition work effectively at the same time, the comprehensive fusion recognition is carried out, and then the next step 8.9 is executed;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether interruption occurs: checking whether the external is interrupted, if not, skipping to the step 8.7 of judging whether the external is in the mode X, otherwise, executing the next step 8.11;
step 8.11: judging whether to quit: checking whether an exit signal exists or not, and if not, skipping to a step 8.3 of judging whether the device is normal or not, otherwise, executing a next step 8.12;
step 8.12: and (3) exiting: and exiting the program and ending the control program.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A multi-mode speech recognition microphone comprises a power module for realizing the function of converting the working voltage of the microphone by supplying power, and is characterized in that: further comprising:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
2, the DSP operation processing module is connected with the FPGA central processing module and the power supply module and is used for realizing the lip segmentation, feature extraction, lip phone identification and fusion identification operation processing functions of the video digital signals;
the audio and video input and output module is connected with the FPGA central processing module and the power supply module and outputs the processed and fused audio signals through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power supply module and is used for finishing power supply switching, mode selection, working state display, light induction and LED light-emitting control;
and the software program module is connected with the FPGA central processing module and is used for finishing the fusion recognition and decision output of the audio and the video.
2. The multi-mode speech recognition microphone of claim 1, wherein: the FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit which is connected with the SAA7111 digital decoder and an audio and video input and output module and is used for finishing the input buffer memory of pre-stage data and the output buffer memory of post-stage data, a DSP unit which is mainly connected with the audio and video input and output module through a virtual DSP and is used for providing the input and output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with a GPIO and a man-machine communication control module and is used for realizing the signal control between internal function modules, and a communication and data buffer part which is used as the FPGA central processing module, the system comprises an SRIO communication and data cache module used for providing a high-speed data processing function of an FPGA central processing module, one of external interface connecting circuits used as the FPGA central processing module, and a signal configuration and integration module used for realizing signal configuration and integration.
3. A multi-mode speech recognition microphone according to claim 2, wherein: the 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through an SRIO 1X interface.
4. A multi-mode speech recognition microphone according to claim 3, wherein: the 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device, optimize the image information processing capacity and the expandability of the system and realize the voice recognition, the lip language recognition and the fusion decision of the transmitter.
5. A multi-mode speech recognition microphone according to claim 2, wherein: the audio and video input and output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal line and used for providing an original video signal source for lip-talk recognition, a TLV320AIC23B sound collection chip connected with a DSP unit in the FPGA central processing module through IIC and McASP interfaces and used for receiving audio signals to realize chip control and data transmission, a bone-sensing sensor and a sound sensor which provide conventional audio and bone-conduction audio signals for the TLV320AIC23B sound collection chip through audio signal lines and an SDRAM1 unit used for providing an extended external data storage space for the DSP unit, wherein the TLV320AIC23B sound collection chip is used for outputting the processed and fused audio signals through a synthetic audio output circuit.
6. A multi-mode speech recognition microphone according to claim 2, wherein: the human-computer communication control module comprises a Cy7C68013A communication controller, a key switch, a light induction control circuit and an LED light-emitting control circuit, wherein the Cy7C68013A communication controller is connected with a signal configuration integration module in the FPGA central processing module through a GPIO (general purpose input/output) to provide a USB communication function, communicates with an external training control computer to finish data downloading and receiving state reply after training, and is connected with a CPLD (complex programmable logic device) unit in the FPGA central processing module through the GPIO to respectively finish power switching, mode selection, working state display, light induction and LED light-emitting control.
7. The multi-modal speech recognition microphone of claim 6, wherein: the key switches comprise a power switch, control keys, a light knob, a liquid crystal display screen, a mouse and a numeric keyboard.
8. The multi-mode speech recognition microphone of claim 1, wherein: the software program module comprises an upper computer training control software module for realizing training of the recognition algorithm and downloading and uploading of data, an embedded system main flow module which is interacted with the upper computer training control software module and used for finishing initialization, self-detection, storage and prompt of fault states, data updating and USB communication, and an embedded system algorithm module used for finishing fusion recognition and decision output of audio and video.
9. The multi-mode speech recognition microphone of claim 8, wherein: the audio recognition of the embedded system algorithm module consists of audio acquisition, preprocessing, vector quantization, voice synthesis and voice recognition, and the video recognition consists of video acquisition, preprocessing, lip segmentation, lip feature extraction and visual recognition.
10. A control method of a multimode speech recognition microphone apparatus according to any one of claims 1 to 9, characterized in that: the method comprises the following specific steps:
step 8.1: initialization and self-detection: initializing a multi-mode voice recognition call device and a control program, carrying out hardware self-detection on the device, acquiring the working state of each module of the device, and executing the next step 8.2 after the working state is finished;
step 8.2: judging whether the device is normal: according to the self-detection module of the device, the data returned from each module is comprehensively compared to give out whether the data is normal or not, when the data is 'fault', the fault is prompted, and the step 8.11 is skipped to whether the data is quitted, and when the data is 'normal', the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with a training control computer through a USB, data updating can be carried out, updating contents mainly comprise an identification algorithm and system optimization, when updating is needed, an updating program is executed, otherwise, the next step 8.4 is executed;
step 8.4: judging whether the setting mode is automatic or not: the system sets automatic and manual setting modes through a manual/automatic key, defaults to automatic, and directly shifts to the next step 8.5; when the mode is a manual mode, the working mode is manually selected, the operation mode is skipped to a setting working module, and the working mode is set;
step 8.5: environmental step noise and brightness: collecting and processing: automatically setting a working mode according to the noise and light conditions collected by the device, and selecting the mode '1' when the noise is less than a reference threshold value 1; selecting a mode "2" or "3" when the noise is greater than or equal to a reference threshold value 1 and less than a reference threshold value 2, and selecting a mode "4" or "5" when the noise is greater than the reference threshold value 2; the brightness is effective only when the LED lamp works in the modes of 3, 4 and 5, when the brightness is smaller than a reference brightness threshold value, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the treatment is finished, the next step 8.6 is executed;
step 8.6: setting a working mode: the automatic operating mode is set by the environmental step noise and brightness: the manual working mode setting is mainly selected through a working mode selection key of a man-machine communication control module, the initial working mode state of the system is '1', and the last working mode is set as the initial state after the system works; the working mode is changed in a circulating mode in sequence every time the key is pressed, after the key is pressed for 3 seconds and the working mode setting is automatically completed, the next step 8.7 is executed, and in addition, the working state of the LED illuminator can be set through the LED switch key;
step 8.7: judging whether the step X value is 1 to 5 in the mode X: when "1", the normal audio speaking voice mode is to be executed; when it is "2", the normal combined larynx speaking voice mode is executed; when it is "3", the normal combined lip reading voice mode will be performed; when the voice is '4', a voice mode of throat and lip combined reading is executed; when the number is '5', a combined voice sending mode of the three is executed; respectively executing different voice transmission mode steps 8.8 according to the mode selection;
step 8.8: performing a voice over talk mode: according to the current working mode, executing a corresponding voice sending mode, which specifically comprises the following steps:
firstly, in a conventional audio voice transmission mode, only a sound sensor works effectively, and a bone feeling sensor and video acquisition do not participate in voice recognition;
secondly, a conventional combined larynx speaking voice mode mainly aims at effectively working of a voice sensor and a bone sensation sensor, and video collection does not participate in voice recognition;
thirdly, a voice mode of conventional combined lip reading and transmitting is mainly that a sound sensor and a video are effectively collected, and a bone feeling sensor does not participate in voice recognition;
fourthly, a voice mode of reading and transmitting voice by combining the larynx and the lip is mainly used for effectively working of a bone feeling sensor and video acquisition, and a voice sensor does not participate in voice recognition;
fifthly, a voice mode of the combined voice transmission of the three is mainly that the sound sensor, the bone feeling sensor and the video acquisition effectively work at the same time, comprehensive fusion recognition is carried out, and then the next step 8.9 is executed;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether interruption occurs: checking whether the external is interrupted, if not, skipping to the step 8.7 of judging whether the external is in the mode X, otherwise, executing the next step 8.11;
step 8.11: judging whether to quit: checking whether an exit signal exists or not, and if not, skipping to a step 8.3 of judging whether the device is normal or not, otherwise, executing a next step 8.12;
step 8.12: and (3) exiting: and exiting the program and ending the control program.
CN202010984329.1A 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof Active CN112164389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010984329.1A CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010984329.1A CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Publications (2)

Publication Number Publication Date
CN112164389A true CN112164389A (en) 2021-01-01
CN112164389B CN112164389B (en) 2023-06-02

Family

ID=73859129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010984329.1A Active CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Country Status (1)

Country Link
CN (1) CN112164389B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162426B1 (en) * 2000-10-02 2007-01-09 Xybernaut Corporation Computer motherboard architecture with integrated DSP for continuous and command and control speech processing
CN101025860A (en) * 2006-02-24 2007-08-29 环达电脑(上海)有限公司 Digital media adaptor with voice control function and its voice control method
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
CN102360187A (en) * 2011-05-25 2012-02-22 吉林大学 Chinese speech control system and method with mutually interrelated spectrograms for driver
CN203070756U (en) * 2012-12-13 2013-07-17 合肥寰景信息技术有限公司 Motion recognition and voice synthesis technology-based sign language-lip language intertranslation system
CN104570835A (en) * 2014-12-02 2015-04-29 苏州长风航空电子有限公司 Cockpit voice command control system and operating method thereof
CA3092795A1 (en) * 2020-09-10 2022-03-10 Holland Bloorview Kids Rehabilitation Hospital Customizable user input recognition systems
CN115691498A (en) * 2021-07-29 2023-02-03 华为技术有限公司 Voice interaction method, electronic device and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162426B1 (en) * 2000-10-02 2007-01-09 Xybernaut Corporation Computer motherboard architecture with integrated DSP for continuous and command and control speech processing
CN101025860A (en) * 2006-02-24 2007-08-29 环达电脑(上海)有限公司 Digital media adaptor with voice control function and its voice control method
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
CN102360187A (en) * 2011-05-25 2012-02-22 吉林大学 Chinese speech control system and method with mutually interrelated spectrograms for driver
CN203070756U (en) * 2012-12-13 2013-07-17 合肥寰景信息技术有限公司 Motion recognition and voice synthesis technology-based sign language-lip language intertranslation system
CN104570835A (en) * 2014-12-02 2015-04-29 苏州长风航空电子有限公司 Cockpit voice command control system and operating method thereof
CA3092795A1 (en) * 2020-09-10 2022-03-10 Holland Bloorview Kids Rehabilitation Hospital Customizable user input recognition systems
CN115691498A (en) * 2021-07-29 2023-02-03 华为技术有限公司 Voice interaction method, electronic device and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALDAHOUD A等: "Robust automatic speech recognition system implemented in a hybrid design DSP-FPGA", 《INTERNATIONAL JOURNAL OF SIGNAL PROCESSING, IMAGE PROCESSING AND PATTERN RECOGNITION》 *
梁涛等: "基于FPGA和DSP的说话人识别系统的设计与实现", 《电子技术应用》 *
王蒙军: "唇读发声器中视觉信息的检测与处理", 《中国博士学位论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN112164389B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
US11348583B2 (en) Data processing method and apparatus for intelligent device, and storage medium
CN111402877B (en) Noise reduction method, device, equipment and medium based on vehicle-mounted multitone area
CN108877805A (en) Speech processes mould group and terminal with phonetic function
CN205263746U (en) On -vehicle infotainment system based on 3D gesture recognition
CN107277904A (en) A kind of terminal and voice awakening method
WO2020057624A1 (en) Voice recognition method and apparatus
US11393468B2 (en) Electronic apparatus and controlling method thereof
EP3886087B1 (en) Method and system of automatic speech recognition with highly efficient decoding
CN110033776A (en) A kind of virtual image interactive system and method applied to screen equipment
CN112230877A (en) Voice operation method and device, storage medium and electronic equipment
CN104049727A (en) Mutual control method for mobile terminal and vehicle-mounted terminal
CN116013257A (en) Speech recognition and speech recognition model training method, device, medium and equipment
CN101169684A (en) Long distance multiple channel human-machine interactive device and its method
CN111292716A (en) Voice chip and electronic equipment
CN112164389B (en) Multi-mode voice recognition speech transmitting device and control method thereof
CN104679733A (en) Voice conversation translation method, device and system
CN105690407A (en) Intelligent robot with expression display function
CN112017659A (en) Processing method, device and equipment for multi-sound zone voice signals and storage medium
CN202838948U (en) Communication device speech-controlling air conditioning based on mobile communication terminal
CN208752948U (en) A kind of intelligent sound control device
CN114779931B (en) Space navigation man-machine interaction platform
CN212572545U (en) AI intelligent recognition intercom
CN112740219A (en) Method and device for generating gesture recognition model, storage medium and electronic equipment
CN104078042B (en) A kind of electronic equipment and a kind of method of information processing
Sun et al. A novel method for multi-sensory data fusion in multimodal human computer interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant