CN112164389B - Multi-mode voice recognition speech transmitting device and control method thereof - Google Patents

Multi-mode voice recognition speech transmitting device and control method thereof Download PDF

Info

Publication number
CN112164389B
CN112164389B CN202010984329.1A CN202010984329A CN112164389B CN 112164389 B CN112164389 B CN 112164389B CN 202010984329 A CN202010984329 A CN 202010984329A CN 112164389 B CN112164389 B CN 112164389B
Authority
CN
China
Prior art keywords
mode
module
voice
audio
processing module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010984329.1A
Other languages
Chinese (zh)
Other versions
CN112164389A (en
Inventor
吴传贵
阚艳
徐贵力
周勇军
李珊珊
胡伟
韩梁
张小辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Run Wuhu Machinery Factory
Original Assignee
State Run Wuhu Machinery Factory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Run Wuhu Machinery Factory filed Critical State Run Wuhu Machinery Factory
Priority to CN202010984329.1A priority Critical patent/CN112164389B/en
Publication of CN112164389A publication Critical patent/CN112164389A/en
Application granted granted Critical
Publication of CN112164389B publication Critical patent/CN112164389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Selective Calling Equipment (AREA)

Abstract

The invention relates to the field of multimode voice recognition and speech transmission, in particular to a multimode voice recognition speech transmission device and a control method thereof, wherein the multimode voice recognition speech transmission device comprises a power supply module and further comprises: the device comprises an FPGA central processing module, a 2DSP operation processing module, an audio/video input/output module, a man-machine communication control module and a software program module, wherein the specific control method comprises the following steps: step 8.1: initializing and self-detecting; step 8.2; judging whether the device is normal or not; step 8.3; judging whether the device is updated or not; step 8.4; judging whether an automatic setting mode is adopted; step 8.5; noise and brightness in the environmental steps; step 8.6; setting a working mode; step 8.7; judging whether the mode is 'X', wherein the X value in the step X is 1 to 5; step 8.8; executing a speech sound transmission mode; step 8.9; outputting voice information; step 8.10; judging whether to interrupt; step 8.11; judging whether to exit; step 8.12: exiting; the invention realizes the multi-mode speech recognition speech transmission and improves the real-time performance and accuracy of the speech transmission.

Description

Multi-mode voice recognition speech transmitting device and control method thereof
Technical Field
The invention relates to the field of multi-mode voice recognition and speech transmission, in particular to a multi-mode voice recognition speech transmission device and a control method thereof.
Background
In the field of speech recognition, the direct speech processing or the throat speech transmission or the lip reading technology is adopted in most cases in the past, and the conventional earphone, the noise reduction enhancement processing and the bone conduction earphone based on the bone conduction technology can meet the speech communication needs in many occasions under the general noise interference. With the continuous application of the general aircraft and the engines, when the aircraft engines are subjected to test flight and test run, the aircraft engines are the main sources of field noise, and particularly, the large-scale aircraft engines have wide noise frequency range and high decibels, so that the normal work communication of field workers is seriously influenced. At present, noise reduction headphones are used, the interference of noise to staff is greatly reduced after the noise is reduced, but the requirements of mutual communication of operators still cannot be met, communication can be carried out only through gestures or other modes, and more information cannot be expressed and transmitted in time.
In the checking process of an aircraft or an aeroengine, an operator needs to carry out effective information communication, and the operation requirements are difficult to meet only through noise reduction headphones, gestures or semaphores and other modes according to the traditional voice communication method and device, so that a new technology and a new method are required to be adopted, the effectiveness and scientificity of voice recognition and speech transmission are improved, and the efficient and safe implementation of the checking of the aircraft or the aeroengine is promoted.
As disclosed in chinese patent application No. 201910032244.0, an intelligent headset and earphone system may be capable of implementing corresponding functional operations according to voice instructions, without manual key, and improving user experience, where the system includes a microphone, a voice processing module, a central processing module, an audio processing module, and a speaker. The disadvantage is that the system is only used for voice command control and does not involve multimode voice recognition and effective communication under loud noise.
As disclosed in the chinese patent application No. 201910012835.1, a hybrid active noise reduction earphone, a noise reduction method and a storage medium are disclosed, which can select the most suitable noise reduction system coefficient, and more quickly and accurately track the change of the noise signal, thereby greatly improving the noise reduction effect. The active noise reduction earphone of the hybrid structure comprises: active noise control system, reference microphone and noise cancellation microphone. The system only carries out loop iteration processing to select the noise reduction system coefficient, and multi-mode voice recognition and effective communication under large noise are not involved.
As disclosed in chinese patent application No. 201810422275.2, a lip detection and reading method based on cascade feature extraction is disclosed, which can improve the speed and accuracy of lip reading. The method has the defect that the multi-mode voice recognition and effective communication under large noise are not involved only through multi-stage extraction and dimension reduction of lip region image characteristics.
As disclosed in chinese patent No. 201611086527.6, a throat microphone audio enhancement processing module is provided, the apparatus comprising a de-breathing sound signal processing board, a power supply and an audio output switch. The disadvantage is that the module improves the clarity and recognition of the throat microphone without involving multimode speech recognition and efficient communication under loud noise.
The 2 nd phase 293-299 of Signal processing published in 2 nd 2019 discloses a land-air communication voice recognition method based on a BiLSTM/CTC model, which mainly aims at the characteristics of civil aviation land-air communication language, the BiLSTM network is trained to obtain the BiLSTM/CTC model, and the acoustic model, the language model and a land-air communication dictionary are utilized to realize voice recognition of the civil aviation land-air communication. The disadvantage is that the system only realizes that the application of the enhanced acoustic model reduces the land-air communication voice recognition to 5.53% in terms of word recognition errors, and does not relate to multi-mode voice recognition and effective communication under high noise.
Pages 287-294 of high-technology communication published in 3 of 2019 disclose a cerebral palsy rehabilitation training system based on human-computer interaction (HCI) design expression and voice interaction, wherein a mode of combining an upper computer and a lower computer is comprehensively adopted, the lower computer adopts a 51 single chip microcomputer to carry out main body driving, a voice acquisition module adopts an LD3320 voice chip, and the lower computer programmed by a C language is connected with the upper computer written by Labview in a serial port communication mode, so that voice semantic rules are identified and matched, and test completion judgment and statistics are realized. The system only designs expression and voice interaction man-machine interaction, and multi-mode voice recognition and effective communication under loud noise are not involved.
Therefore, the design research of the voice recognition is mainly the research on the aspects of a headset and earphone system, a lip detection and reading method, a throat microphone and related improvement of voice recognition quality, and the design research is the functions of a cerebral palsy rehabilitation training system, civil aviation land-air communication, voice command control and the like, but the research on multi-mode voice recognition and effective voice transmission methods and devices under loud noise is less. It is necessary to develop a multi-mode speech recognition transmitter under loud noise and a control method thereof.
Disclosure of Invention
In order to solve the above-mentioned problems, the present invention provides a multimode speech recognition transmitter and a control method thereof.
A multimode speech recognition speech transmitting device comprises a power module which realizes the function of converting the working voltage of the device by power supply, and further comprises:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
the 2DSP operation processing module is connected with the FPGA central processing module and the power module and is used for realizing lip segmentation, feature extraction, lip speech recognition and fusion recognition type operation processing functions of video digital signals;
the audio/video input/output module is connected with the FPGA central processing module and the power supply module and outputs the audio signals after processing and fusion through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power module and used for completing power switch, mode selection, working state display, light induction and LED luminous control;
and the software program module is connected with the FPGA central processing module and used for completing fusion identification and decision output of the audio and the video.
The FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit connected with the SAA7111 digital decoder and an audio/video input/output module for finishing input buffer of front-stage data and output buffer of rear-stage data, a DSP unit which is externally connected with the audio/video input/output module through a virtual DSP and is used for providing input/output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with the GPIO and man-machine communication control module and is used for realizing signal control between internal functional modules, a communication and data buffer part serving as the FPGA central processing module, an SRIO communication and data buffer module for providing high-speed data processing functions of the FPGA central processing module, and an external signal configuration integration module serving as one of interface connection circuits of the FPGA central processing module and used for realizing configuration and integration of signals.
The 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through SRIO 1X interfaces.
The 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device and optimize the image information processing capability and the expandability of the system, and realize the voice recognition, lip-phone recognition and fusion decision of the speech transmitting device.
The audio/video input/output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal wire and used for providing an original video signal source for lip call identification, a DSP unit connected with the FPGA central processing module through an IIC and McASP interface and used for receiving audio signals, chip control and data transmission are realized, and meanwhile, the audio signals after processing and fusion are output through a TLV320AIC23B sound collection chip through a synthetic audio output circuit, a bone sensor and a sound sensor for providing conventional audio and bone conduction audio signals for the TLV320AIC23B sound collection chip through an audio signal wire, and an SDRAM1 unit for providing an extended external data storage space for the DSP unit.
The man-machine communication control module comprises a Cy7C68013A communication controller which is connected with a signal configuration integration module in the FPGA central processing module through GPIO and is used for providing a USB communication function, communicating with an external training control computer and finishing data downloading and receiving state reply after training, a CPLD unit which is connected with the FPGA central processing module through GPIO and is used for respectively finishing a power switch, a mode selection, a working state display, a key switch type of light induction and LED luminous control, a light induction control circuit and an LED luminous control circuit.
The key switch comprises a power switch, a control key, a bright knob, a liquid crystal display screen, a mouse and a numeric keyboard.
The software program module comprises an upper computer training control software module for realizing training of an identification algorithm and data downloading and uploading, an embedded system main flow module which is interacted with the upper computer training control software module for completing initialization, self-detection, fault state storage and prompt, data updating and USB communication, and an embedded system algorithm module for completing fusion identification and decision output of audio and video.
The audio frequency recognition of the embedded system algorithm module consists of audio frequency collection, preprocessing, vector quantization, voice synthesis and voice recognition, and the video frequency recognition consists of video frequency collection, preprocessing, lip segmentation, lip feature extraction and visual recognition.
A control method of a multimode voice recognition speech transmitting device comprises the following specific steps:
step 8.1: initializing and self-detecting: initializing a multi-mode voice recognition speech transmitting device and a control program, performing device hardware self-detection, acquiring the working states of all modules of the device, and executing the next step 8.2 after the completion;
step 8.2: judging whether the device is normal or not: according to the self-detection module of the device, the data returned from each module are comprehensively compared to give out whether the data are normal or not, fault prompt is carried out when the data are in fault, and the step 8.11 is jumped to be exited or not, and when the data are in normal, the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with the training control computer through the USB, data updating can be carried out, the updating content is mainly the identification algorithm and system optimization, when 'updating' is needed, an updating program is executed, and otherwise, the next step 8.4 is executed;
step 8.4: judging whether the automatic setting mode is as follows: the system sets an automatic setting mode and a manual setting mode through a manual/automatic key, defaults to an automatic setting mode, and directly shifts to the next step 8.5; when the mode is a manual mode, the manual mode is selected, the operation mode is jumped to the setting operation module, and the operation mode is set;
step 8.5: noise and brightness of environmental steps: and (3) acquisition and processing: according to the noise and the brightness collected by the device, automatically setting a working mode, and selecting a mode '1' when the noise is smaller than a reference threshold value 1; when the noise is larger than or equal to the reference threshold value 1 and smaller than the reference threshold value 2, selecting the mode '2' or '3', and when the noise is larger than or equal to the reference threshold value 2, selecting the mode '4' or '5'; the brightness is only effective when working in the modes of 3, 4 and 5, when the brightness is smaller than the reference brightness threshold value, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the processing is finished, the next step of 8.6 is executed;
step 8.6: setting an operating mode: the automatic working mode setting is made up of the environmental steps noise and light: the manual working mode is selected by acquisition and processing, and is mainly selected by a working mode selection key of a man-machine communication control module, wherein the initial working mode state of the system is 1, and the last working mode is the initial state after working; when the key is pressed once, the working mode is circularly changed in turn, after the key is pressed for 3 seconds, the working mode is automatically set, the next step 8.7 is executed, and in addition, the working state of the LED illuminator can be set through the LED switch key;
step 8.7: judging whether the mode is 'X', wherein the X value is 1 to 5 in the step of X: when "1", a conventional audio transmission voice mode will be performed; when "2", a conventional combined laryngeal speech mode will be performed; when "3", a conventional combined lip read speech sound mode will be performed; when the voice is '4', the voice mode is read by the combined lip of the throat; when the voice signal is '5', the voice mode is sent by combining the three modes; according to the mode selection, respectively executing different transmission voice modes 8.8;
step 8.8: performing a speech transmission voice mode: according to the current working mode, executing a corresponding transmission voice mode, specifically:
1. the conventional audio transmission voice mode only effectively works by the sound sensor, and the bone sensor and the video acquisition do not participate in voice recognition;
2. the conventional combined throat sending voice mode mainly comprises the effective work of a sound sensor and a bone sensor, and video acquisition does not participate in voice recognition;
3. the conventional combined lip reading and sending voice mode mainly comprises effective work of a sound sensor and video acquisition, and a bone sensor does not participate in voice recognition;
4. the combined lip of the throat sends out the pronunciation sound mode, mainly bone sensor and video collect the effective work, the sound sensor does not participate in the speech recognition;
5. the three are combined to send the voice mode, mainly the voice sensor, bone sensor and video acquisition work at the same time effectively, carry on the comprehensive fusion recognition, then carry on the next step 8.9;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether to interrupt: checking whether the outside has an interruption, if not, jumping to the step 8.7 of judging whether the mode is the mode 'X', otherwise, executing the next step 8.11;
step 8.11: judging whether to exit: checking whether an exit signal exists or not, if not, jumping to the step 8.3 of judging whether the device is normal or not, otherwise, executing the next step 8.12;
step 8.12: exiting: and (5) exiting the program and ending the control program.
The beneficial effects of the invention are as follows: the invention realizes the speech transmission of multi-mode speech recognition, the modes selection, the conventional audio and throat speech transmission, lip speech recognition and the fusion speech transmission capability, is used for speech transmission under the condition of large noise when an aircraft or aeroengine is checked, and operators can perform speech transmission through multi-mode switching and combination so as to achieve the purposes of better guaranteeing the effectiveness and scientificity of communication information capacity and conversation quality, particularly the speech recognition and lip speech recognition and fusion of audio and video signals, and improves the instantaneity, accuracy and practicability of speech transmission through embedded software development design.
Drawings
The invention will be further described with reference to the drawings and examples.
FIG. 1 is a schematic diagram of the structure of the present invention;
FIG. 2 is a schematic diagram of a software program module structure according to the present invention;
FIG. 3 is a schematic flow chart of the control method of the present invention.
Detailed Description
The present invention will be further described in the following to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the present invention easy to understand.
As shown in fig. 1 to 3, a multimode speech recognition speech transmitting device includes a power module for implementing a device operating voltage conversion function by supplying power, and further includes:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
the 2DSP operation processing module is connected with the FPGA central processing module and the power module and is used for realizing lip segmentation, feature extraction, lip speech recognition and fusion recognition type operation processing functions of video digital signals;
the audio/video input/output module is connected with the FPGA central processing module and the power supply module and outputs the audio signals after processing and fusion through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power module and used for completing power switch, mode selection, working state display, light induction and LED luminous control;
and the software program module is connected with the FPGA central processing module and used for completing fusion identification and decision output of the audio and the video.
The invention realizes the speech transmission of multi-mode speech recognition, the modes selection, the conventional audio and throat speech transmission, lip speech recognition and the fusion speech transmission capability, is used for speech transmission under the condition of large noise when an aircraft or aeroengine is checked, and operators can perform speech transmission through multi-mode switching and combination so as to achieve the purposes of better guaranteeing the effectiveness and scientificity of communication information capacity and conversation quality, particularly the speech recognition and lip speech recognition and fusion of audio and video signals, and improves the instantaneity, accuracy and practicability of speech transmission through embedded software development design.
The FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit connected with the SAA7111 digital decoder and an audio/video input/output module for finishing input buffer of front-stage data and output buffer of rear-stage data, a DSP unit which is externally connected with the audio/video input/output module through a virtual DSP and is used for providing input/output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with the GPIO and man-machine communication control module and is used for realizing signal control between internal functional modules, a communication and data buffer part serving as the FPGA central processing module, an SRIO communication and data buffer module for providing high-speed data processing functions of the FPGA central processing module, and an external signal configuration integration module serving as one of interface connection circuits of the FPGA central processing module and used for realizing configuration and integration of signals.
The FPGA central processing module is adopted to combine high-performance FPGA and SRIO communication with the data caching function, so that an SAA7111 digital decoder, a FIFO unit, a DSP unit, a CPLD unit, an SRIO communication and data caching module and a signal configuration integration module can be generated, the comprehensive control, audio and video data processing, high-speed data transmission and signal configuration integration capability of the transmitter are realized, the flexible advantage of the quasi-FPGA is effectively exerted, and the working efficiency and the reliability of the circuit are improved; the complexity of the device is also simplified, so that the circuit is simplified and the hardware cost is reduced.
The FPGA central processing module is respectively connected with the 2DSP operation processing module, the audio/video input/output module, the man-machine communication control module and the power module, performs interactive control with the 2DSP operation processing module through a 2-path SRIO 1X interface, is respectively connected with the audio/video input/output module through a video signal, an IIC and an McASP interface, and is mainly controlled through a GPIO interface, and various power voltages required by work are provided by the power module.
The FIFO unit is used as a data buffer and is mainly connected with the SAA7111 digital decoder and the DSP unit.
The SAA7111 digital decoder is used as a video acquisition processing unit and is respectively connected with the video acquisition unit, the CPLD unit and the FIFO unit.
The 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through SRIO 1X interfaces.
The DSP unit is a virtual DSP which is externally connected with the audio/video input/output module, and the TLV320AIC23B sound acquisition chip is controlled through an IIC interface; the McASP interface is used for realizing data receiving, the EMIF interface is connected with the SDRAM1 unit to realize audio data acquisition and processing, and the McASP interface is mainly connected with the FIFO unit and the CPLD unit in pairs to provide the input and output functions of external audio signals of the FPGA central processing module.
The CPLD unit is used as one of external interface connection circuits of the FPGA central processing module, is connected through the GPIO and the man-machine communication control module, provides man-machine interaction control and light induction and LED luminous control functions of the FPGA central processing module, and is connected with the SAA7111 digital decoder, the DSP unit and the signal configuration integration module.
The 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device and optimize the image information processing capability and the expandability of the system, and realize the voice recognition, lip-phone recognition and fusion decision of the speech transmitting device.
The voice recognition, lip recognition and fusion decision capability of the speech transmitting device are realized by adopting the 2DSP operation processing module, two TMS320C6455 processors are selected for high-speed data interaction with the FPGA central processing through the SRIO 1X, parallel processing can be performed, great calculation amount operation can be rapidly completed, the real-time performance and recognition rate requirements of the device are met, and the image information processing capability and the expandability of the system are optimized.
The SRIO communication and data caching module is used as a communication and data caching part of the FPGA central processing module, performs interactive control with the 2DSP operation processing module through a 2-path SRIO 1X interface, and is connected with the internal and signal configuration integration module.
The signal configuration integration module is used as one of external interface connection circuits of the FPGA central processing module, is connected with the man-machine communication control module through the GPIO and provides a USB communication function of the FPGA central processing module
The audio/video input/output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal wire and used for providing an original video signal source for lip call identification, a DSP unit connected with the FPGA central processing module through an IIC and McASP interface and used for receiving audio signals, chip control and data transmission are realized, and meanwhile, the audio signals after processing and fusion are output through a TLV320AIC23B sound collection chip through a synthetic audio output circuit, a bone sensor and a sound sensor for providing conventional audio and bone conduction audio signals for the TLV320AIC23B sound collection chip through an audio signal wire, and an SDRAM1 unit for providing an extended external data storage space for the DSP unit.
The audio and video input and output module is adopted to realize the audio and video acquisition, voice information preprocessing and synthesized audio output capability of the speech transmitter, and the TLV320AIC23B voice acquisition chip is adopted to carry out input and output on multiple paths of audio signals, so that the programmable gain adjustment is provided, the high audio performance input and output of the speech transmitter are met, the low energy consumption is also met, and the energy efficiency ratio of the device operation is improved.
The man-machine communication control module comprises a Cy7C68013A communication controller which is connected with a signal configuration integration module in the FPGA central processing module through GPIO and is used for providing a USB communication function, communicating with an external training control computer and finishing data downloading and receiving state reply after training, a CPLD unit which is connected with the FPGA central processing module through GPIO and is used for respectively finishing a power switch, a mode selection, a working state display, a key switch type of light induction and LED luminous control, a light induction control circuit and an LED luminous control circuit.
The man-machine communication control module is adopted to realize the man-machine control of the transmitter and the USB communication capability of the upper computer, the Cy7C68013A communicator is adopted, the USB2.0 protocol is supported based on an interface embedded with a microprocessor, the data transmission to the USB data port can be completed by simply matching a plurality of registers and memories, the design of a program is simplified, the transmission rate is improved, and the reliability is increased.
The key switch comprises a power switch, a control key, a bright knob, a liquid crystal display screen, a mouse and a numeric keyboard.
The software program module comprises an upper computer training control software module for realizing training of an identification algorithm and data downloading and uploading, an embedded system main flow module which is interacted with the upper computer training control software module for completing initialization, self-detection, fault state storage and prompt, data updating and USB communication, and an embedded system algorithm module for completing fusion identification and decision output of audio and video.
The software program module is adopted to realize the software function control and operation capability of the speech transmitter, the embedded system and the PC system are adopted to realize the speech recognition, fusion decision and the like of the embedded system, the PC system is used for completing the codebook training and the speech template training, the work division is clear between the embedded system and the PC system, the requirements of recognition rate and recognition time are met, the system is simplified, and the hardware cost is reduced.
The audio frequency recognition of the embedded system algorithm module consists of audio frequency collection, preprocessing, vector quantization, voice synthesis and voice recognition, and the video frequency recognition consists of video frequency collection, preprocessing, lip segmentation, lip feature extraction and visual recognition.
The embedded system algorithm module is a core in the software program module.
A control method of a multimode voice recognition speech transmitting device comprises the following specific steps:
step 8.1: initializing and self-detecting: initializing a multi-mode voice recognition speech transmitting device and a control program, performing device hardware self-detection, acquiring the working states of all modules of the device, and executing the next step 8.2 after the completion;
step 8.2: judging whether the device is normal or not: according to the self-detection module of the device, the data returned from each module are comprehensively compared to give out whether the data are normal or not, fault prompt is carried out when the data are in fault, and the step 8.11 is jumped to be exited or not, and when the data are in normal, the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with the training control computer through the USB, data updating can be carried out, the updating content is mainly the identification algorithm and system optimization, when 'updating' is needed, an updating program is executed, and otherwise, the next step 8.4 is executed;
step 8.4: judging whether the automatic setting mode is as follows: the system sets an automatic setting mode and a manual setting mode through a manual/automatic key, defaults to an automatic setting mode, and directly shifts to the next step 8.5; when the mode is a manual mode, the manual mode is selected, the operation mode is jumped to the setting operation module, and the operation mode is set;
step 8.5: noise and brightness of environmental steps: and (3) acquisition and processing: according to the noise and the brightness collected by the device, automatically setting a working mode, and selecting a mode '1' when the noise is smaller than a reference threshold value 1; when the noise is larger than or equal to the reference threshold value 1 and smaller than the reference threshold value 2, selecting the mode '2' or '3', and when the noise is larger than or equal to the reference threshold value 2, selecting the mode '4' or '5'; the brightness is only effective when working in the modes of 3, 4 and 5, when the brightness is smaller than the reference brightness threshold value, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the processing is finished, the next step of 8.6 is executed;
step 8.6: setting an operating mode: the automatic working mode setting is made up of the environmental steps noise and light: the manual working mode is selected by acquisition and processing, and is mainly selected by a working mode selection key of a man-machine communication control module, wherein the initial working mode state of the system is 1, and the last working mode is the initial state after working; when the key is pressed once, the working mode is circularly changed in turn, after the key is pressed for 3 seconds, the working mode is automatically set, the next step 8.7 is executed, and in addition, the working state of the LED illuminator can be set through the LED switch key;
step 8.7: judging whether the mode is 'X', wherein the X value is 1 to 5 in the step of X: when "1", a conventional audio transmission voice mode will be performed; when "2", a conventional combined laryngeal speech mode will be performed; when "3", a conventional combined lip read speech sound mode will be performed; when the voice is '4', the voice mode is read by the combined lip of the throat; when the voice signal is '5', the voice mode is sent by combining the three modes; according to the mode selection, respectively executing different transmission voice modes 8.8;
step 8.8: performing a speech transmission voice mode: according to the current working mode, executing a corresponding transmission voice mode, specifically:
6. the conventional audio transmission voice mode only effectively works by the sound sensor, and the bone sensor and the video acquisition do not participate in voice recognition;
7. the conventional combined throat sending voice mode mainly comprises the effective work of a sound sensor and a bone sensor, and video acquisition does not participate in voice recognition;
8. the conventional combined lip reading and sending voice mode mainly comprises effective work of a sound sensor and video acquisition, and a bone sensor does not participate in voice recognition;
9. the combined lip of the throat sends out the pronunciation sound mode, mainly bone sensor and video collect the effective work, the sound sensor does not participate in the speech recognition;
10. the three are combined to send the voice mode, mainly the voice sensor, bone sensor and video acquisition work at the same time effectively, carry on the comprehensive fusion recognition, then carry on the next step 8.9;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether to interrupt: checking whether the outside has an interruption, if not, jumping to the step 8.7 of judging whether the mode is the mode 'X', otherwise, executing the next step 8.11;
step 8.11: judging whether to exit: checking whether an exit signal exists or not, if not, jumping to the step 8.3 of judging whether the device is normal or not, otherwise, executing the next step 8.12;
step 8.12: exiting: and (5) exiting the program and ending the control program.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. The utility model provides a multimode speech recognition send telephone device, includes through power module that power supply realized device operating voltage conversion class function, its characterized in that: further comprises:
the FPGA central processing module is connected with the power supply module and is used for realizing central processing;
the 2DSP operation processing module is connected with the FPGA central processing module and the power module and is used for realizing lip segmentation, feature extraction, lip speech recognition and fusion recognition type operation processing functions of video digital signals;
the audio/video input/output module is connected with the FPGA central processing module and the power supply module and outputs the audio signals after processing and fusion through the synthesized audio output circuit;
the man-machine communication control module is connected with the FPGA central processing module and the power module and used for completing power switch, mode selection, working state display, light induction and LED luminous control;
the software program module is connected with the FPGA central processing module and used for completing fusion identification and decision output of the audio and the video;
the FPGA central processing module comprises an SAA7111 digital decoder for realizing digital processing of video signals, an FIFO unit connected with the SAA7111 digital decoder and an audio/video input/output module for finishing input buffer of front-stage data and output buffer of rear-stage data, a DSP unit which is externally connected with the audio/video input/output module through a virtual DSP and is used for providing input/output functions of external audio signals of the FPGA central processing module, a CPLD unit which is connected with the GPIO and man-machine communication control module and is used for realizing signal control between internal functional modules, a communication and data buffer part serving as the FPGA central processing module, an SRIO communication and data buffer module for providing high-speed data processing functions of the FPGA central processing module, and an external signal configuration integration module serving as one of interface connection circuits of the FPGA central processing module and used for realizing configuration and integration of signals;
the audio/video input/output module comprises a video collector connected with an SAA7111 digital decoder in the FPGA central processing module through a video signal wire and used for providing an original video signal source for lip call identification, a DSP unit connected with the FPGA central processing module through an IIC and McASP interface and used for receiving audio signals, chip control and data transmission are realized, and meanwhile, the audio signals after processing and fusion are output through a TLV320AIC23B sound collection chip through a synthetic audio output circuit, a bone sensor and a sound sensor for providing conventional audio and bone conduction audio signals for the TLV320AIC23B sound collection chip through an audio signal wire, and an SDRAM1 unit for providing an extended external data storage space for the DSP unit.
2. A multi-mode speech recognition apparatus according to claim 1, wherein: the 2DSP operation processing module comprises a DSP1 unit and a DSP2 unit which are respectively connected with the FPGA central processing module through SRIO 1X interfaces.
3. A multi-mode speech recognition apparatus according to claim 2, wherein: the 2DSP operation processing module adopts two TMS320C6455 processors which meet the requirements of real-time performance and recognition rate of the device and optimize the image information processing capability and the expandability of the system, and realize the voice recognition, lip-phone recognition and fusion decision of the speech transmitting device.
4. A multi-mode speech recognition apparatus according to claim 1, wherein: the man-machine communication control module comprises a Cy7C68013A communication controller which is connected with a signal configuration integration module in the FPGA central processing module through GPIO and is used for providing a USB communication function, communicating with an external training control computer and finishing data downloading and receiving state reply after training, a CPLD unit which is connected with the FPGA central processing module through GPIO and is used for respectively finishing a power switch, a mode selection, a working state display, a key switch type of light induction and LED luminous control, a light induction control circuit and an LED luminous control circuit.
5. A multi-mode speech recognition apparatus according to claim 4, wherein: the key switch comprises a power switch, a control key, a bright knob, a liquid crystal display screen, a mouse and a numeric keyboard.
6. A multi-mode speech recognition apparatus according to claim 1, wherein: the software program module comprises an upper computer training control software module for realizing training of an identification algorithm and data downloading and uploading, an embedded system main flow module which is interacted with the upper computer training control software module for completing initialization, self-detection, fault state storage and prompt, data updating and USB communication, and an embedded system algorithm module for completing fusion identification and decision output of audio and video.
7. A multi-mode speech recognition apparatus according to claim 6, wherein: the audio frequency recognition of the embedded system algorithm module consists of audio frequency collection, preprocessing, vector quantization, voice synthesis and voice recognition, and the video frequency recognition consists of video frequency collection, preprocessing, lip segmentation, lip feature extraction and visual recognition.
8. A control method using a multi-mode voice recognition transmitter according to any one of claims 1 to 7, characterized by: the method comprises the following specific steps:
step 8.1: initializing and self-detecting: initializing a multi-mode voice recognition speech transmitting device and a control program, performing device hardware self-detection, acquiring the working states of all modules of the device, and executing the next step 8.2 after the completion;
step 8.2: judging whether the device is normal or not: according to the self-detection module of the device, the data returned from each module are comprehensively compared to give out whether the data are normal or not, fault prompt is carried out when the data are in fault, and the step 8.11 is jumped to be exited or not, and when the data are in normal, the next step 8.3 is executed;
step 8.3: judging whether the device is updated: when the device is connected with the training control computer through the USB, data updating is carried out, the updating content is mainly the identification algorithm and system optimization, when 'updating' is needed, an updating program is executed, otherwise, the next step 8.4 is executed;
step 8.4: judging whether the automatic setting mode is as follows: the system sets an automatic setting mode and a manual setting mode through a manual/automatic key, defaults to an automatic setting mode, and directly shifts to the next step 8.5; when the mode is a manual mode, the manual mode is selected, the operation mode is jumped to the setting operation module, and the operation mode is set;
step 8.5: noise and brightness of environmental steps: and (3) acquisition and processing: according to the noise and the brightness acquired by the device, automatically setting a working mode, and selecting a mode '1' when the noise is smaller than a reference threshold value 1; selecting a mode '2' or '3' when the noise is greater than or equal to a reference threshold 1 and less than a reference threshold 2, and selecting a mode '4' or '5' when the noise is greater than or equal to the reference threshold 2; the brightness is only valid when working in the modes '3', '4', '5', when the brightness is smaller than the reference brightness threshold, the LED illuminator is turned on, otherwise, the LED illuminator is turned off, and after the processing is completed, the next step 8.6 is executed;
step 8.6: setting an operating mode: the automatic working mode setting is made up of the environmental steps noise and light: the manual working mode is selected by acquisition and processing, and is mainly selected by a working mode selection key of a man-machine communication control module, wherein the initial working mode state of the system is 1, and the last working mode is the initial state after working; when the key is pressed once, the working mode is circularly changed in turn, after the key is pressed for 3 seconds, the working mode is automatically set, the next step 8.7 is executed, and in addition, the working state of the LED illuminator is set through the LED switch key;
step 8.7: judging whether the mode is 'X', wherein the X value is 1 to 5 in the step of X: when "1", a conventional audio transmission voice mode will be performed; when "2", a conventional combined laryngeal speech mode will be performed; when "3", a conventional combined lip read speech sound mode will be performed; when the voice is '4', the voice mode is read by the combined lip of the throat; when the voice signal is '5', the voice mode is sent by combining the three modes; according to the mode selection, respectively executing different transmission voice modes 8.8;
step 8.8: performing a speech transmission voice mode: according to the current working mode, executing a corresponding transmission voice mode, specifically:
the conventional audio transmission voice mode only effectively works by the sound sensor, and the bone sensor and the video acquisition do not participate in voice recognition;
the conventional combined throat sending voice mode mainly comprises the effective work of a sound sensor and a bone sensor, and video acquisition does not participate in voice recognition;
the conventional combined lip reading and sending voice mode mainly comprises effective work of a sound sensor and video acquisition, and a bone sensor does not participate in voice recognition;
the combined lip of the throat sends out the pronunciation sound mode, mainly bone sensor and video collect the effective work, the sound sensor does not participate in the speech recognition;
the three are combined to send the voice mode, mainly the voice sensor, bone sensor and video acquisition work at the same time effectively, carry on the comprehensive fusion recognition, then carry on the next step 8.9;
step 8.9: and (3) voice information output: after outputting the fused voice information, executing the next step 8.10;
step 8.10: judging whether to interrupt: checking whether the outside has an interruption, if not, jumping to the step 8.7 of judging whether the mode is the mode 'X', otherwise, executing the next step 8.11;
step 8.11: judging whether to exit: checking whether an exit signal exists or not, if not, jumping to the step 8.3 of judging whether the device is normal or not, otherwise, executing the next step 8.12;
step 8.12: exiting: and (5) exiting the program and ending the control program.
CN202010984329.1A 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof Active CN112164389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010984329.1A CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010984329.1A CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Publications (2)

Publication Number Publication Date
CN112164389A CN112164389A (en) 2021-01-01
CN112164389B true CN112164389B (en) 2023-06-02

Family

ID=73859129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010984329.1A Active CN112164389B (en) 2020-09-18 2020-09-18 Multi-mode voice recognition speech transmitting device and control method thereof

Country Status (1)

Country Link
CN (1) CN112164389B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162426B1 (en) * 2000-10-02 2007-01-09 Xybernaut Corporation Computer motherboard architecture with integrated DSP for continuous and command and control speech processing
CN101025860A (en) * 2006-02-24 2007-08-29 环达电脑(上海)有限公司 Digital media adaptor with voice control function and its voice control method
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
CN102360187A (en) * 2011-05-25 2012-02-22 吉林大学 Chinese speech control system and method with mutually interrelated spectrograms for driver
CN203070756U (en) * 2012-12-13 2013-07-17 合肥寰景信息技术有限公司 Motion recognition and voice synthesis technology-based sign language-lip language intertranslation system
CN104570835A (en) * 2014-12-02 2015-04-29 苏州长风航空电子有限公司 Cockpit voice command control system and operating method thereof
CA3092795A1 (en) * 2020-09-10 2022-03-10 Holland Bloorview Kids Rehabilitation Hospital Customizable user input recognition systems
CN115691498A (en) * 2021-07-29 2023-02-03 华为技术有限公司 Voice interaction method, electronic device and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162426B1 (en) * 2000-10-02 2007-01-09 Xybernaut Corporation Computer motherboard architecture with integrated DSP for continuous and command and control speech processing
CN101025860A (en) * 2006-02-24 2007-08-29 环达电脑(上海)有限公司 Digital media adaptor with voice control function and its voice control method
CN101101752A (en) * 2007-07-19 2008-01-09 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
CN102360187A (en) * 2011-05-25 2012-02-22 吉林大学 Chinese speech control system and method with mutually interrelated spectrograms for driver
CN203070756U (en) * 2012-12-13 2013-07-17 合肥寰景信息技术有限公司 Motion recognition and voice synthesis technology-based sign language-lip language intertranslation system
CN104570835A (en) * 2014-12-02 2015-04-29 苏州长风航空电子有限公司 Cockpit voice command control system and operating method thereof
CA3092795A1 (en) * 2020-09-10 2022-03-10 Holland Bloorview Kids Rehabilitation Hospital Customizable user input recognition systems
CN115691498A (en) * 2021-07-29 2023-02-03 华为技术有限公司 Voice interaction method, electronic device and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Robust automatic speech recognition system implemented in a hybrid design DSP-FPGA;Aldahoud A等;《International Journal of signal processing, image processing and pattern recognition》;第6卷(第5期);第 333-342页 *
唇读发声器中视觉信息的检测与处理;王蒙军;《中国博士学位论文全文数据库信息科技辑》(第08期);第I138-22页 *
基于FPGA和DSP的说话人识别系统的设计与实现;梁涛等;《电子技术应用》;第34卷(第9期);第43-46页 *

Also Published As

Publication number Publication date
CN112164389A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN108962240B (en) Voice control method and system based on earphone
US11295760B2 (en) Method, apparatus, system and storage medium for implementing a far-field speech function
US20220020356A1 (en) Method and apparatus of synthesizing speech, method and apparatus of training speech synthesis model, electronic device, and storage medium
CN102855872B (en) Based on terminal and the mutual household electric appliance control method of internet voice and system
US20210097994A1 (en) Data processing method and apparatus for intelligent device, and storage medium
CN107277904A (en) A kind of terminal and voice awakening method
CN108877805A (en) Speech processes mould group and terminal with phonetic function
CN111402877B (en) Noise reduction method, device, equipment and medium based on vehicle-mounted multitone area
CN205263746U (en) On -vehicle infotainment system based on 3D gesture recognition
EP3886087B1 (en) Method and system of automatic speech recognition with highly efficient decoding
CN110992955A (en) Voice operation method, device, equipment and storage medium of intelligent equipment
CN111755002B (en) Speech recognition device, electronic apparatus, and speech recognition method
CN112230877A (en) Voice operation method and device, storage medium and electronic equipment
CN112164389B (en) Multi-mode voice recognition speech transmitting device and control method thereof
CN101169684A (en) Long distance multiple channel human-machine interactive device and its method
CN116013257A (en) Speech recognition and speech recognition model training method, device, medium and equipment
CN111292716A (en) Voice chip and electronic equipment
CN103095927A (en) Displaying and voice outputting method and system based on mobile communication terminal and glasses
CN104679733A (en) Voice conversation translation method, device and system
CN111128164B (en) Control system for voice acquisition and recognition and implementation method thereof
CN202838948U (en) Communication device speech-controlling air conditioning based on mobile communication terminal
Liu et al. Design and implementation of human-computer interaction intelligent system based on speech control
CN208752948U (en) A kind of intelligent sound control device
CN104078042B (en) A kind of electronic equipment and a kind of method of information processing
CN212572545U (en) AI intelligent recognition intercom

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant