CN208335743U

CN208335743U - A kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain

Info

Publication number: CN208335743U
Application number: CN201820632770.1U
Authority: CN
Inventors: 罗坚; 罗艺; 罗诗光; 李峰军
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2019-01-04
Anticipated expiration: 2028-04-28

Abstract

The intelligent robot Semantic interaction system based on white light communication and the cognition of class brain that the utility model discloses a kind of, realizes the physical positioning of robot, by white light communication to switch the situation mode under different scenes.The intelligent semantic interaction schemes that system has used offline and cloud to merge online simultaneously realize the class brain intelligent robot Semantic interaction that offline and cloud combines online.Wherein, the online class brain intelligent robot Semantic interaction system of cloud is made of versatile class brain speech recognition cognitive model, class brain Semantic interaction model and speech synthesis platform, it can be very good the application of expansion service robot, user experience is improved, while can targetedly be provided personalized service for different home.

Description

A kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain

Technical field

The utility model relates to robot voice intelligent interaction fields, in particular to a kind of to be recognized based on white light communication and class brain The intelligent robot Semantic interaction system known.

Background technique

With the continuous development of modern science and technology and computer technology, people no longer adhere rigidly in the information interchange with machine Keyboard operation of the mankind to machine, but a kind of more convenient, natural interactive mode is needed, and language is that the mankind are most important And most effective information source realizes that if the language interaction between man-machine allows robot to listen to understand people be also that the mankind dream of Thing.The development of speech recognition technology, so that this ideal is achieved.

Auditory system is always the important component of intelligent robot sensory perceptual system, and its object is to preferably complete people Information exchange between robot.With traditional keyboard, data interaction that mouse and display carry out is different, using the sense of hearing into The transmission of row data enables robot more anthropomorphic and intelligent.Sense of hearing interactive system is related to the speech recognition in artificial intelligence, class People's construction of knowledge base, the advanced technologies such as semantic retrieval, speech recognition and speech synthesis, with very wide application prospect and larger Practical value.

Currently for the technical solution of robot voice identification, traditional way is using speech chip or using single-chip microcontroller System realizes offline speech identifying function, and discrimination is not high, be generally only capable of identifying simple word and order.

Another method is exactly that long-range speech recognition is realized using communication module, robot voice controlling terminal into Row voice collecting is identified by network transmission to remote computation generator terminal.

With the appearance of the platforms such as cloud computing and cloud storage, robot voice is carried out using cloud platform and knows method for distinguishing very Improve that identified off-line precision is not high and the small problem in word library in big program.

Traditional intelligent interaction technology is often realized on service robot itself platform, for example simple speech recognition is calculated Method, video acquisition and based process etc. have certain difficulty if realizing more complicated algorithm.Because they are to machine The arithmetic speed requirement of people's control system is quite high, while the problems such as mass data storage of pattern recognition system equally limits The further development of offline service robot.

The scheme practicability for carrying out speech recognition based on remote computer is not high, and extended capability is not strong, local with being used only The effect that computer is identified is similar.

The speech recognition schemes for being currently based on cloud platform mostly use greatly universal phonetic library to be analyzed and identified, cannot embody Personalized feature, is only analyzed and is identified to the voice signal for being transferred to cloud platform, cannot carry out man-machine chat well The operation with certain semantic feature such as exchange (for example tells robot that you will listen a bent specific music, it is allowed to download and play Deng), while the distinctive Semantic interaction under different situations can not be realized well, the semanteme of context cannot be made full use of Information interacts.In addition, need robot system to keep network connection when carrying out speech recognition using cloud platform, it cannot be very The offline intelligent robot interactive controlling of good realization.

Utility model content

In order to solve limitation existing for current speech recognition, the utility model provides what one kind can occur according to voice Scene to carry out the intelligent robot Semantic interaction system and method based on white light communication and the cognition of class brain of identification interaction automatically.

In order to achieve the above technical purposes, the technical solution of the utility model is,

It is a kind of based on white light communication and class brain cognition intelligent robot Semantic interaction system, including offline voice collecting and Identify hardware system, class brain semantics recognition and cognition hardware system and white light communication and indoor situation positioning system, it is described Offline voice collecting and identification hardware system be communicatively connected to respectively class brain semantics recognition cognition hardware system and white light communication and Indoor situation positioning system,

The offline voice collecting and identification hardware system includes embedded control system, speech recognition module and audio Processing circuit, the embedded control system communicate to connect speech recognition module and audio frequency processing circuit respectively, in each need The place for carrying out scene Recognition is provided with a speech recognition module and an audio frequency processing circuit；

The class brain semantics recognition cognition hardware system includes device for embedded control, remote communication module and long-range language Adopted identification device, the device for embedded control is communicatively connected to remote speech by remote communication module and semantics recognition fills It sets, device for embedded control is also communicatively connected to offline voice collecting and identification hardware system；

The described white light communication and interior situation positioning system include multiple LED white light circuits and with LED white light circuit number Equal white light identification circuit is measured, needs the place for carrying out scene Recognition to be provided with a LED white light circuit and one each The luminous white light identification circuit of a white light circuit of LED for identification, each white light identification circuit are communicatively connected to offline voice collecting With identification hardware system.

A kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain, the offline language The embedded control system of sound acquisition and identification hardware system includes STM32 embedded system, the speech recognition module packet LD3320 speech recognition module is included, the audio frequency processing circuit includes audio filter circuit, audio amplifier circuit, multiple microphones Array and multiple audio playing circuits, it is each that the place for carrying out scene Recognition is needed to be mounted on a microphone array and logical Cross audio amplifier circuit and audio filter circuit be connected to STM32 embedded system, the LD3320 speech recognition module and Multiple audio playing circuits are respectively connected to STM32 embedded system, each that the place for carrying out scene Recognition is needed to be mounted on One audio playing circuit.

A kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain, the class brain language Justice cognition hardware system includes device for embedded control, remote communication module and remote speech semantic recognition device, and described is embedding Entering formula control device includes ARM11 embedded system, and the remote communication module includes WiFi communication module, 4G mobile communication Module and WLan router, the long-range semantic recognition device include cloud voice semantics recognition platform, cloud intelligence machine Mankind's brain Semantic interaction platform and cloud speech synthesis platform, the ARM11 embedded system by WiFi communication module or 4G mobile communication module is connected to WLan router, and cloud voice semantics recognition platform is sequentially connected cloud intelligence machine mankind's brain Semantic interaction platform and cloud speech synthesis platform, cloud Semantic interaction platform and cloud speech synthesis platform respectively with the road WLan It is communicated to connect by device, ARM11 embedded system is connected to offline voice collecting and identifies the device for embedded control of hardware system.

A kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain, the white light are logical Letter and the LED white light circuit of indoor situation positioning system include white light LED array, LED array driving circuit, the communication of LED white light Signal modulation and demodulator circuit, white light driving and communication system STM32 controller, the white light LED array are set to accordingly The place for needing to carry out scene Recognition at, the described white light driving and communication system STM32 controller pass through LED array driving Circuit and LED white light signal of communication modulation and demodulation circuit to communicate to connect with white light LED array, the white light identification circuit Including high-speed photodiode sensor array and LED white light demodulator circuit, the high-speed photodiode sensor array It is set to and needs to irradiate at the place for carrying out scene Recognition and by white light LED array accordingly, the LED white light demodulator circuit Input terminal communicate to connect high-speed photodiode sensor array, output end is communicatively connected to offline voice collecting and identification is hard Part system.

A kind of intelligent robot Semantic interaction method based on white light communication and the cognition of class brain, using described based on white light The intelligent robot Semantic interaction system of communication and the cognition of class brain, comprising the following steps:

Step 1: simulating bionical human brain hierarchical structure using Cerebral cortex learning algorithm, building class brain speech recognition recognizes mould Type；Network is fought by production, in voice input terminal, changes primary voice data length, increase interfering noise and artificial system It makes shortage of data mode and expands voice training data, to enhance the robustness of speech recognition cognitive model；

Step 2: using the corpus under different situations locating for different location, in conjunction with the sparse term vector coding staff of class brain Class brain Semantic interaction system is trained by interrogation reply system and constructed to method and the real-time memory models of level；

It needs to carry out field Step 3: receiving using the embedded system that STM32 is core by photoelectric receiving transducer The LED white light sensor array position that sends over of coding and contextual information on the place of scape identification, by decoded positions and Context data, speech recognition and class brain Semantic interaction system correspond to the selection of semantic base on guide line；

Step 4: offline voice collecting and identifying system realize acquisition and front-end processing to voice, and judge that system is No networking is online, and offline speech recognition and output are realized when system is non-online；When system is online, voice data is transmitted To varieties of clouds brain voice semantics recognition platform, and it will identify that the voice semantic text information come is sent to class brain Semantic interaction platform It is analyzed, predicts optimum answer with its knowledge base for corresponding to situation, return again to and carry out voice number to speech synthesis platform According to synthesis, finally synthesis voice is played out to complete intelligent human-machine interaction.

The method, the step 1 the following steps are included:

1) it chooses level and remembers basis of the Cerebral cortex learning algorithm as voice semantics recognition system model in real time；

2) on the basis of Cerebral cortex algorithm, bionical human brain structure constructs the class brain speech recognition cognitive model of multi-layer Structure realizes that the level includes primary voice data sensing layer, middle diencephalon to the class brain deep learning of voice semantic sequence Cortex learning layer, semantic feature space layer and timing layer；The primary voice data sensing layer input is digital audio-frequency data, Audio data after exporting speech terminals detection is to Cerebral cortex learning layer；The intermediate Cerebral cortex learning layer identification inputs true The real or imaginary voice data being fitted to exports as binary word vector；The semantic feature space layer input is middle diencephalon skin The single term vector of matter learning layer output, exports as term vector set；The timing layer, by the language in semantic feature space layer Words vector set constitutes the sentence and text data with temporal aspect, is carried out with contextual information to voice data pre- It surveys and identifies；

3) in primary voice data sensing layer one end, access production fights network, for synthesizing virtual data, expands instruction Practice sample, the production confrontation network includes the discrimination model for generating model and generating model for training, generates model The distribution of sample data is captured, discrimination model is two classifiers, differentiates that input is the sample of truthful data or generation, model training When fixed party, update the parameter of another model, alternating iteration finally estimates sample so that the mistake of other side maximizes The distribution of data, so that the virtual data for generating model synthesis completes the training for generating model close to authentic specimen data；

4) above-mentioned trained generation model is used, K group dummy synthesis sample { Y is generated_v ¹,...,Y_v ^K, extend to voice In training data, training is participated in；

5) after the completion of the building of voice semantics recognition system model, system is trained using the audio data of recording, mistake Journey is as follows:

Firstly, collect the voice dialogue text fragments under public mandarin corpus and different situations, containing different native places and Property others Mandarin Chinese recording data, the voice quantity collected in total is N；

Then, word cutting is carried out as unit of sentence to recording corpus, i.e., individually split the word in sentence, owned After the completion of sentence word cutting, it is classified as M word in total；

To the word that N primary voice data and M word cutting generate, instructed using class brain voice semanteme learning model Practice, when training, voice data is inputted from primary voice data sensing layer, generates corresponding binary system semantic text language from timing layer Expect data, while to original language material data, in primary voice data sensing layer, fighting network using above-mentioned production, carrying out empty The synthesis of quasi- sample, I voice data of dummy synthesis are trained together；

6) voice semantics recognition system model training input is voice data s_in, trained prediction output is the semantic text of voice This sequence is T_predict, corresponding real speech semantic text sequence is T_true, it is to be indicated in the form of term vector in timing layer Text sequence, the residual error of the two be δ=| | T_predict-T_true||², enable all parameters in model be expressed as W, utilize optimization Method iterative model parameter keeps residual error δ value minimum, and iteration stopping condition isIt completes to class brain speech recognition The training of cognitive model.

The method, the step 3) include following procedure:

1) the generation model described in is realized using multi-layer perception (MLP), according to voice data S=[s to be trained₁,..., s_n,...,s_N], wherein N is voice sum, s_nFor the nth voice binary features data and s after normalization_nIt is tieed up for l, wherein The integer that l=0,1,2...L, L are > 0 passes through timing, increase interfering noise and artificial manufacture before and after variation primary voice data The missing mode of voice data obtains three groups and virtually generates voice data collectionWithWhereinIt is virtually closed for timing nth generated before and after variation voice data At voice binary features data,It makes an uproar to increase interference to voice data Sound nth dummy synthesis voice binary features data generated, Nth dummy synthesis voice binary features data generated are lacked artificially to manufacture voice data, are enabledS_vIt indicates WithThree dummy synthesis data total collections；

2) fixed to generate model parameter, every voice data that three groups virtually generate is differentiated respectively, discrimination model It is realized using including the convolutional neural networks of two layers of convolutional layer, two layers of maximum sub-sampling layer and one layer of output diagnostic horizon；First The convolution kernel of layer convolutional layer is i × i dimension, and the second layer is the maximum sub-sampling layer of j × j, and third layer is the volume that k × k ties up convolution kernel Lamination, the 4th layer of maximum sub-sampling layer for p × q, the last layer are that output differentiates probability layer, whereinWherein l=0,1,2...L, L are positive real number, and l is the voice binary features number after normalization According to dimension,For integer, the convolution operation at matrix (i, j) pixel is expressed ass_v ∈S_vIndicate that the voice data of 1 l dimension virtually generated, Z indicate two-dimensional convolution nuclear matrix, j × j maximum sub-sampling is by matrix Become from original l × l dimensionDimension, i.e., any region j × j reserved volume product value maximal term, therefore, matrix Pixel is reduced to originalIt is then p × q using the 4th layer using third layer convolutional layer after maximum sub-sampling Maximum sub-sampling layer, s_vAfter above-mentioned nonlinear transformation, two-dimensional space is finally projected toWhereinIndicate 2-D data space, two dimensional characterProbability layer is differentiated by finally exporting, i.e., output is as a result, order isIt indicates to generation sample s_vDifferentiated, result is " to generate sample This " differentiate correct probability,It indicates to differentiate that result is the probability that " initial data " differentiates mistake, adds up and differentiate knot The correct probability of fruit:As largest optimization objective function, iteration updates the parameter of discrimination model, makes this The value of objective function is maximum；

3) parameter of fixed discrimination model, the parameter of the more newly-generated model of iteration regenerate virtual sampleEqually makeThe value of objective function is maximum；

4) continue alternating iteration, minimize the value of objective function, stopping criterion for iteration is

The method, the step 2 the following steps are included:

1) collecting includes parlor leisure corpus, bedroom sleep corpus, and study learns corpus, and sanitation park or square moves corpus, online shopping Customer service corpus, health medical treatment corpus, the elderly accompany and attend to corpus, and child nurses corpus, and information inquires the different situations including corpus Under text corpus, generate the corpus under different situations, and word cutting is carried out to all corpus, generate word question-answering mode；

2) the sparse term vector coding method of class brain and the real-time memory models of level are combined, are trained by interrogation reply system and structure Build the class brain Semantic interaction system under different corpus situations；The sparse term vector coding of the class brain is with binary sparse vector Mode indicate the word in text, specific coding method is as follows:

The binary sparse term vector x=[a for enabling n tie up₁,...,a_n], element a in vector_nValue be 0 or 1, when being 0 Quantity is rarefaction representation when being much larger than 1 quantity；

Define two binary sparse term vector x₁And x₂Overlapping degree calculate function overlap (x₁,x₂)=x₁·x₂, And with this come judge two words close to program, given threshold λ, when overlay programme is more than that threshold value then indicates two word phases Match: match (x₁,x₂)=overlap (x₁,x₂)≥λ；

3) training method of the real-time memory models of level is as follows in step 2):

Semantic word after question and answer corpus word cutting is formed by way of the sparse term vector coding of class brain with timing spy The semantic text of sign enables text vector be expressed as y=[x₁,...,x_t,...,x_T],x_tIndicate wherein t moment The binary sparse term vector of n dimension；

According to the successive of timing, the training input using as unit of binary sparse word vectors as model is enabled as input_t =x_t, output is exported using the binary sparse word vectors at t+1 moment as training_t=x_t+1, chronologically input completes one Question and answer are the question and answer knowledge training for completing a text sequence, finally train the model for having semantic forecast function；

4) when testing and using trained model, first according to specific scene location information, corresponding contextual model is selected Corpus training pattern, wherein scene location information is true by directly reading the scene location information that comes transmitted by white light communication It is fixed；If being unable to get the scene location information to come transmitted by white light communication, the corpus model under all scenes is utilized, according to It is secondary that analysis prediction is carried out to speech text currently entered, contextual model and final defeated is determined with the prediction of maximum probability output Out, predict that contextual model locating for the maximum training model of output probability is current context mode；Again to class brain voice The text that identifies of identification cognitive model carries out word cutting, and the semantic word cut is carried out the sparse term vector of class brain and is encoded, according to when Sequence is successively sent in the real-time memory models of trained level；When having inputted the last one problem word input_N=x_NWhen, it is right The prediction output answered is first semantic word output of answer_N=z₁, z₁For the binary system of the N+1 moment n dimension of prediction output Sparse term vector；Again by z₁Term vector feeds back to input terminal, the input input as the N+1 moment_N+1=z₁, fed back by circulation Afterwards, the corresponding prediction text answers of final question and answer are obtained, probability r%, wherein r is the probability value of prediction result confidence level, 0 ≤r≤100。

The method, the step 3 the following steps are included:

1) it is modulated by the way of Binary Frequency Shift Keying as the LED white light sensor array of transmitting terminal, number The modulated optical signal for emitting 200KHz when signal 1, is the modulated optical signal of 0Hz when digital signal is 0；And it is infrared logical using NEC Letter agreement realizes the digital data transmission between transmitting terminal and receiving end by frequency shift keying；

2) optical signal received as the photoelectric receiving transducer of receiving end passes through conversion of photoelectric sensor into electric signal, electricity Signal is decoded by the decoder being made of phase discriminator, low-pass filter and AD analog-digital converter；Receiving terminal receives When the modulated signal of 200KHz, other interference signals are filtered out by bandpass filter, and the modulated signal of 200KHz is carried out Coherent demodulation, then demodulation amount is obtained by low-pass filter, and compared with 0V carries out voltage, when receiving 200KHz optical signal, Demodulate output level 1, output level 0 when not receiving modulated optical signal；3) for the interior space of different situations, it is mounted on day White light LEDs on card have independent position and context token information, and constantly send to region and carry context token The white light of data then decodes its position and contextual information when receiving end, which receives, enters corresponding white light, to realize room The extraction of interior positioning and context data.

The method, the step 4 the following steps are included:

1) ARM11 embedded system 14 is once communicated at interval of 6s clock time with server, if receiving cloud clothes Business device response then indicates that networking is online, is otherwise off-line state, and sound-light alarm prompts；

2) if it is off-line state, speech recognition is realized by LD3320 module, when carrying out offline speech recognition, Serial communication mode is first passed through, the voice data that will be identified downloads in LD3320 speech recognition module, completes crucial repertorie Building；

3) when identified off-line, by being sent into audio data stream, voice recognition chip detects to use by end-point detecting method Family pipes down, and after voice data user to be loquitured between piping down carries out operational analysis, provides recognition result；

4) it if it is presence, is held by voice data of the robot control system based on ARM11 to acquisition Point detection, and by the raw audio file of primary voice data, voice to be identified is sent to speech recognition platforms as unit of sentence Data；

5) after cloud class brain voice semantics recognition system receives voice data, it is decoded and speech pattern recognition, Optimal recognition result is obtained, is sent to class brain Semantic interaction platform in a text form, while white light communication being received Location information and contextual model send the past；

6) intelligence machine mankind brain Semantic interaction platform carries out class brain according to the contextual model and contextual information received Semantic analysis by choosing corresponding situation semantic base, and therefrom matches optimal feedback semantic data, by it with text Form is sent to cloud speech synthesis platform；

7) speech synthesis platform in cloud carries out speech synthesis according to the text received, generates voice document, and be returned to base In the robot control system of ARM11, after robot control system receives voice, voice is carried out by external audio output circuit Output is played, and continues to acquire and receive the voice signal of next step, completes lasting class brain intelligent semantic interaction.

The utility model has technical effect that solve semantic analysis ability existing for current speech interaction robot The problems such as weak, Personalized service is not strong, it is poor to lack context recognition function, user experience and limited by network, can By its application service old machine people, household robot, the related fieldss such as the elderly's monitoring have good economy and society effect Benefit.

The utility model is described in further detail with reference to the accompanying drawing.

Detailed description of the invention

Fig. 1 is system construction drawing；

Fig. 2 is that white light communicates transmit circuit schematic diagram；

Fig. 3 is white light communications reception circuit diagram；

Fig. 4 is implementation flow chart；

Fig. 5 is offline speech recognition schematic diagram；

Fig. 6 is class brain voice semantics recognition system schematic；

Fig. 7 is class brain Semantic interaction systematic training schematic diagram；

Fig. 8 is that class brain Semantic interaction system uses schematic diagram.

Wherein, 1STM32 embedded system；2 audio filter circuits；3 audio amplifier circuits；4 microphone arrays；5LD3320 language Sound identification module；6LED white light demodulator circuit；7 high-speed photodiode sensor arrays；8 different situations spaces；9 white light LEDs Array；10LED array drive circuit；11LED white light signal of communication modulation and demodulation circuit；The driving of 12 white lights and communication system STM32 controller；13 audio playing circuits；14ARM11 embedded system；15Wifi communication module；16 4G mobile communication moulds Block；17WLan router；18 cloud voice semantics recognition platforms；19 cloud intelligence machine mankind's brain Semantic interaction platforms；20 clouds Hold speech synthesis platform.

Specific embodiment

The present embodiment include offline voice collecting and identification hardware system, class brain semantics recognition and cognition hardware system and White light communication and indoor situation positioning system, offline voice collecting and identification hardware system are communicatively connected to the knowledge of class brain semanteme respectively It Ren Zhi not hardware system and white light communication and indoor situation positioning system.

Offline voice collecting and identification hardware system include embedded control system, speech recognition module and audio processing electricity Road, embedded control system communicate to connect speech recognition module and audio frequency processing circuit respectively, need to carry out scene knowledge each Other place is provided with a speech recognition module and an audio frequency processing circuit.

It includes device for embedded control, remote communication module and long-range semantics recognition that class brain semantics recognition, which recognizes hardware system, Device, the device for embedded control are communicatively connected to remote speech and semantic recognition device by remote communication module, insertion Formula control device is also communicatively connected to offline voice collecting and identification hardware system.

White light communication and interior situation positioning system include multiple LED white light circuits and equal with LED white light circuit quantity White light identification circuit, need that the place for carrying out scene Recognition is provided with a LED white light circuit and one is used for each Identify that the luminous white light identification circuit of LED white light circuit, each white light identification circuit are communicatively connected to offline voice collecting and identification Hardware system.

In order to further expand usage mode, the place for carrying out scene Recognition is needed to be provided with a voice knowledge each The mode of the white light identification circuit that LED white light circuit shines for identification of other module, an audio frequency processing circuit and one, can also It is interpreted as that speech recognition module and audio frequency processing circuit are mounted on the same moveable intelligence machine respectively by above-mentioned 3 circuits On people, which is moved to different scenes as needed, and identifies shining for LED white light circuit in scene, then Carry out phonetic incepting, the voice broadcast finally fed back.

The present embodiment is the embedded system of core, LD3320 signer-independent sign language recognition module, microphone battle array using STM32 Column, speech front-end processing circuit, voice playing module construct offline voice collecting and identifying system；It is grasped using Linux is loaded Make the ARM embedded system of system, wireless WIFI module, 4G mobile communication module, cloud speech recognition platforms, cloud speech synthesis is flat Platform, intelligence machine mankind brain Semantic interaction platform construct on-line speech identification, semantic analysis and interactive system；It is white using LED Photosensor array, LED drive circuit, LED communication control circuit communicate and indoor situation positioning system to construct white light.It is first First, determine whether to be connected to network by ARM embedded system, so that it is determined that using on speech recognition mode under line or line The online speech recognition of cloud and semantic analysis mode.Then, pass through photoelectric receiving transducer by the embedded system of core of STM32 Receive the position and contextual information that the coding of LED white light sensor array on indoor roof sends over, by decoded positions and Context data carrys out the selection of speech recognition and class brain Semantic interaction system to certain semantic library on guide line.Offline voice collecting The acquisition and front-end processing to voice are realized with identifying system, and system realizes offline speech recognition and output when non-online；System When online, the class brain speech recognition cognition platform that voice data is transmitted to cloud is identified, then will identify that Voice semantic text information is sent to intelligence machine mankind's brain Semantic interaction platform and analyzes, with the knowledge base of corresponding situation Optimum answer is obtained, returns again to and carries out voice data synthesis to cloud speech synthesis platform, final intelligent robot is to lift up one's voice Mode by synthesize voice play out to complete intelligent human-machine interaction.

The embedded control system of offline voice collecting and identification hardware system includes STM32 embedded system, described Speech recognition module includes LD3320 speech recognition module, and the audio frequency processing circuit includes that audio filter circuit, audio are put Big circuit, multiple microphone arrays and multiple audio playing circuits, it is each that the place for carrying out scene Recognition is needed to be mounted on one Microphone array, and STM32 embedded system, the LD3320 are connected to by audio amplifier circuit and audio filter circuit Speech recognition module and multiple audio playing circuits are respectively connected to STM32 embedded system, each to need to carry out scene Recognition Place be mounted on an audio playing circuit.

Referring to Fig. 1-8, include: based on offline voice collecting and identification hardware system constructed by the present embodiment

1) offline voice collecting and identification hardware system are by STM32 embedded system 1, audio filter circuit 2, audio amplification Circuit 3, microphone array 4 and LD3320 speech recognition module are constituted；

2) audio filter circuit is made of six rank analogue low pass filtering circuits and 64 rank FIR digital band pass filter circuits.

Building is by ARM embedded system, wireless WIFI module, 4G mobile communication module, the online semantics recognition of cloud, semantic friendship The class brain semantic knowledge software and hardware system of mutual and speech synthesis system composition:

1) on-line speech identification and interactive system are by ARM11 embedded system 14, Wifi communication module 15,4G mobile communication Module 16, WLAN router 17, cloud speech recognition platforms 18, cloud intelligence machine mankind's brain Semantic interaction platform 19 and cloud Speech synthesis platform 20 is constituted.

2) wherein ARM11 uses (SuSE) Linux OS, carries out terminal App programming using Python, programs in Python In, it is specifically used to carry out the relevant operation (mp3 file generated, mp3 file play etc.) of voice to PyAudio component, it with Offline speech collecting system STM32 controller carries out data communication by serial ports；

3) cloud semantics recognition and interactive system hardware, which are used, can be carried out what parallel acceleration calculated with GPU (graphics processor) Server has Python development platform.

4) speech synthesis platform in cloud synthesizes interface using Baidu's cloud voice online, and platform uses REST api interface, adopts It is requested with Http mode, is applicable to the speech recognition of any platform, in Python environment programming, use urllib, urllib2 Http protocol data transmission and parsing are completed with pycurl component.

Construct white light communication and indoor situation positioning system:

1) white light communication and positioning system are by white light LED array 9,11 He of LED drive circuit 10 and LED communication control circuit STM32 controller 12 is constituted.

2) white light LED array uses the astigmatism LED 160-180LM of 36 3W power, is combined company according to parallel mode It connects, driving circuit is driven using IRFP4468 power MOS switch tube；

3) the digital communication control of white light LEDs is modulated by PWM, and PWM frequency is led in 200KHz, duty ratio 25% The timer for crossing STM32 generates.

4) sophisticated signals such as audio using carrier modulation technique, are modulated to carrier wave by the complicated simulation letter such as white light LEDs audio Upper (200KHz carrier wave) controls white light LEDs by driving circuit and shines, send eventually by optical signal, used herein Fundamental modulation chip is CD4046.

Construct cloud class brain speech recognition cognitive system:

2) on the basis of Cerebral cortex algorithm, bionical human brain structure constructs the class brain speech recognition cognitive model of multi-layer Structure realizes that the level includes: primary voice data sensing layer to the class brain deep learning of voice semantic sequence, intermediate Cerebral cortex learning layer, semantic feature space layer and timing layer；The primary voice data sensing layer input is digital audio number According to the audio data after exporting speech terminals detection is to Cerebral cortex learning layer；The intermediate Cerebral cortex learning layer identifies input True or dummy synthesis voice data, export as binary word vector；The semantic feature space layer input is centre The single term vector of Cerebral cortex learning layer output, exports as term vector set；The timing layer, will be in semantic feature space layer Language term vector set constitute have temporal aspect sentence and text data, with contextual information to voice data into Row prediction and identification.

3) in primary voice data sensing layer one end, access production fights network, for synthesizing virtual data, expands instruction Practice sample.The production confrontation network includes the discrimination model for generating model and generating model for training, generates model It is a kind of Game Relationship with discrimination model, discrimination model effect can be generated more to preferably improve generation model Close to the data of authentic specimen.The distribution that model captures sample data is generated, discrimination model is two classifiers, differentiates that input is true Real data or the sample of generation, fixed party when model training, update the parameter of another model, alternating iteration, so that other side Mistake maximize, the distribution of sample data is finally estimated, so that generating the virtual data of model synthesis close to authentic specimen Data complete the training for generating model.

4) the production model described in is realized using multi-layer perception (MLP), according to voice data S=to be trained [s₁,...,s_n,...,s_N], wherein N is voice sum, s_n(s is enabled for the nth voice binary features data after normalization_nFor L=43681 dimension data), by variation primary voice data front and back timing, increases interfering noise and artificially manufacture voice data Missing mode obtains three groups and virtually generates voice data collectionWithWhereinIt is virtually closed for timing nth generated before and after variation voice data At voice binary features data, It makes an uproar to increase interference to voice data Sound nth dummy synthesis voice binary features data generated, Nth dummy synthesis voice binary features data generated are lacked artificially to manufacture voice data, are enabledS_vIt indicatesWithThree dummy synthesis data total collections；

5) fixed to generate model parameter, every voice data that three groups virtually generate is differentiated respectively, discrimination model It is realized using the convolutional neural networks containing two layers of convolutional layer, two layers of maximum sub-sampling layer and output diagnostic horizon.First layer volume The convolution kernel of lamination is the dimension of i × i=10 × 10, and the second layer is the maximum sub-sampling layer of j × j=20 × 20, and third layer is k × k The convolutional layer of=5 × 5 dimension convolution kernels, the 4th layer of maximum sub-sampling layer for p × q=6 × 3, the last layer are that output differentiates general Rate layer.Wherein, the convolution operation at matrix (i, j) pixel is expressed ass_vIt indicatesDimension virtually generate voice data (since voice is one-dimensional data, l=43681 dimension it is one-dimensional to Amount need to be transformed intoThe matrix-vector of dimension), Z indicates two-dimensional convolution nuclear matrix, and j × j=20 × 20 is maximum Sub-sampling is to become matrix from 200 × 200 dimensions after first layer convolutionDimension, i.e., any j × The region j=20 × 20 reserved volume product value maximal term, therefore, matrix pixel point is reduced to originalMaximum son After sampling, the convolutional layer of convolution kernel is tieed up using third layer k × k=5 × 5, becomes 6 × 6 dimensions, is then p using the 4th layer The maximum sub-sampling layer of × q=6 × 3 becomes 1 × 2 dimension, s_vAfter above-mentioned nonlinear transformation, two-dimensional space is finally projected toWhereinIndicate 2-D data space, two dimensional characterBy finally exporting differentiation probability Layer, i.e. output are as a result, order isIt indicates to generation sample s_vIt carries out Differentiating, result is the probability of " generating sample " (differentiating correct),It indicates to differentiate that result is that " initial data " (differentiates wrong Probability accidentally).It is cumulative to differentiate the correct probability of result:As largest optimization objective function, iteration updates The parameter of discrimination model keeps the value of this objective function maximum.

6) parameter of fixed discrimination model, the parameter of the more newly-generated model of iteration regenerate virtual sampleEqually makeThe value of objective function is maximum.

7) continue alternating iteration, change the value of objective function most, stopping criterion for iteration is

8) above-mentioned trained generation model is used, K=2 group dummy synthesis sample is generatedLanguage when extending to In sound training data, training is participated in.

9) after the completion of model construction, system is trained using the audio data of recording, process is as follows:

Firstly, collecting public mandarin corpus, using 2600 people's Mandarin Chinese mobile phone speech databases, contain different nationalitys The Mandarin Chinese recording data with gender speaker is passed through, enabling the voice quantity collected in total is N₁=800409；

Then, word cutting is carried out as unit of sentence to mandarin recording corpus, i.e., be individually partitioned into the word in sentence Come, after the completion of enabling all sentence word cuttings, is classified as M in total₁A word；

(voice quantity is N to voice several pieces sections for collecting under X=1000 Y=10 class difference situations₂=200000), 10 class Different contextual models specifically include that parlor leisure situation, bedroom sleep situation, study Studying Situntion, sanitation park or square movement situation, net Interaction context, health medical treatment situation are purchased, the elderly accompanies and attends to situation, and child nurses situation, and information inquires situation and general situation, together Sample carries out word cutting as unit of sentence, is classified as M in total₂A word；

To N=N₁+N₂The word that primary voice data and M word cutting generate, using class brain voice semanteme learning model into Row training, when training, voice data is inputted from primary voice data sensing layer, generates the semantic text of corresponding binary system from timing layer This corpus data, while to original language material data, in primary voice data sensing layer, network is fought using above-mentioned production, into I=2 × 3 × N=6002454 voice data of the synthesis of row virtual sample, dummy synthesis is trained together.

10) model training input is voice data (audio data) s_in, trained prediction output is voice semantic text sequence It is classified as T_predict(timing layer, indicated in the form of term vector), corresponding real speech semantic text sequence are T_true(timing layer, Indicated in the form of term vector), the residual error of the two isAll parameters in model are enabled to be expressed as W, using most Optimization method iterative model parameter keeps residual error δ value minimum, and iteration stopping condition isClass brain voice After the completion of identifying cognitive model training, to any audio data of input, corresponding language text can be identified.

Construct cloud Semantic interaction system:

1) Python web crawlers is utilized, (lie fallow the text corpus under online collection difference situation corpus in parlor, crouches Room is slept corpus, and study learns corpus, and sanitation park or square moves corpus, online shopping customer service corpus, health medical treatment corpus, and the elderly accompanies and attends to language Material, child nurse corpus, and information inquires corpus etc.), the corpus under different situations is generated, and word cutting is carried out to all corpus, Generate word question-answering mode；

2) the sparse term vector coding method of class brain and the real-time memory models of level are combined, are trained by interrogation reply system and structure Build the class brain Semantic interaction system under different situations；

3) it is above-mentioned 2) in class brain sparse term vector coding be exactly to be indicated in text with the mode of binary sparse vector Word (word), specific coding method are as follows:

The binary sparse term vector x=[a for enabling n=1024 tie up₁,...,a_n], element a in vector_nQuantity for 1 is w= 40, it is much larger than 1 quantity for 0 quantity at this time, meets class brain rarefaction representation mode.Neuron is represented by signal stimulus for 1 It is activated, is not activated for 0 expression, respond and indicate different by once activating w=40 neuron of different location Phrase pattern, such as x₁=[0 1000 1... 001110 0] and x₂=[1 10011 ... 000 110 1] different word vectors are indicated.

The overlapping degree for defining two binary sparse term vectors calculates function overlap (x, y)=xy, and is come with this Judge two words close to program, set threshold values λ=40*80%=32, then indicate two when overlay programme is more than threshold values 32 A word matches: match (x, y)=overlap (x, y) >=32.

4) it is above-mentioned 2) in the training methods of the real-time memory models of level see Fig. 7, the specific steps are as follows:

Semantic word after question and answer corpus word cutting is formed by way of the sparse term vector coding of class brain with timing spy The semantic text of sign enables y=[x₁,...,x_t,...,x_T],x_tThe wherein binary sparse word of t moment n dimension Vector.In the corpus formed such as " turning in a report " this word, " submission " is the word at t=1 moment, and " report " is the t=2 moment Word, x can be used respectively_T=1And x_T=2Binary sparse term vector indicate the two words.

According to the successive of timing, the training input using as unit of binary sparse word vectors as model is enabled as input_t =x_t, output is exported using the binary sparse word vectors at t+1 moment as training_t=x_t+1, i.e., above-mentioned " submission " is as instruction Practice input, corresponding output is " report ", trains next model in this way and just has semantic forecast function, when chronologically inputting As soon as after completing a question and answer, completing the question and answer training an of text sequence.

5) it tests and uses trained model process as shown in Figure 8, fed back contextual model is first communicated according to white light Information selects different contextual models；Word cutting is carried out to the text that class brain speech recognition cognitive model identifies again, by what is cut Semantic word carries out the sparse term vector coding of class brain, is successively sent in the real-time memory models of trained level according to timing.When The last one problem word input is inputted_N=x_NWhen, corresponding prediction output is first semantic word output of answer_N =z₁, z₁For the binary sparse term vector of the N+1 moment n dimension of prediction output.Again by z₁Term vector feeds back to input terminal, as N The input input at+1 moment_N+1=z₁, after circulation feedback, the corresponding prediction text answers of available final question and answer, than Such as " what day is it today? " enter model as input after word cutting, prediction output is " Friday ", and probability r%, wherein r is pre- Survey the probability value of result credibility, 0≤r≤100.

White light LEDs on indoor roof are received by photoelectric receiving transducer using the embedded system that STM32 is core The position and contextual information that array code sends over, by decoded positions and context data, speech recognition, class brain on guide line Semantic analysis and interactive system correspond to the selection of semantic base:

1) position and contextual information receive system by high speed SFH203P PIN photodiode array 7, STM32 controller 1, signal demodulating circuit 6 is constituted.

2) transmitting terminal is modulated by the way of Binary Frequency Shift Keying, and the modulation light of 200Kz is emitted when digital signal 1 Signal is the modulated optical signal of 0Hz when digital signal is 0.

3) in demodulating end, circuit is mainly the bandpass filter of center frequency, amplifier and voltage comparator by 200KHz It constitutes, when receiving the modulated signal of 200KHz, is filtered out other interference signals by bandpass filter, and by the tune of 200KHz Signal processed carries out coherent demodulation, then obtains demodulation amount by low-pass filter, and compared with 0V progress voltage, when receiving 200KHz When optical signal, output level 1, output level 0 when not receiving modulated optical signal are demodulated；

4) on the basis of frequency shift keying, the transmission of digital signal is realized using NEC infrared communication protocol；

5) in demodulating end, by conversion of photoelectric sensor at the electric signal for carrying audio, electric signal passes through by reflecting optical signal The decoder that phase device, low-pass filter and AD analog-digital converter are constituted is decoded, and the phase demodulation frequency of phase discriminator is set in 200KHz, it is consistent with the carrier frequency of transmitting terminal.What low-pass filter came out seeks to received analog signal, is turned by modulus Parallel operation is converted into digital signal.The demodulation chip being used herein as based on CD4046.

6) for the interior space of different situations, the white light LEDs being ceiling mounted carry out independent position and situation (two position situations: study and dining room are arranged) in mark information in implementation process, and constantly send its situation to region Flag data and suggestion voice information can decode its position, situation when receiving end enters its light source overlay area It, can benefit when being unable to get situation feedback information to extract indoor positioning and context data with suggestion voice information With all training models, analysis prediction successively is carried out to speech text currently entered, it is defeated with the prediction of maximum probability Contextual model and final output are determined out, predict that contextual model locating for the maximum training model of output probability is to work as Preceding contextual model.

Offline voice collecting and identifying system realize acquisition and front-end processing to voice, and judge whether system networks Line realizes that offline speech recognition and output process are as follows when system is non-online:

1) ARM11 embedded system 14 is once communicated at interval of 6s clock time with server, if receiving cloud clothes Business device response then indicates that networking is online, is otherwise off-line state, and sound-light alarm prompts.

2) if it is off-line state, speech recognition is realized by LD3320, when carrying out offline speech recognition, is first led to Serial communication mode is crossed, the voice data that will be identified downloads in LD3320 speech recognition module, completes the structure of crucial repertorie It builds.

3) when identified off-line, by being sent into audio data stream, voice recognition chip detects to use by end-point detecting method Family pipes down, and after voice data user to be loquitured between piping down carries out operational analysis, provides recognition result.

When system is online, cloud speech recognition platforms are sent speech data to, and will identify that the speech text information come It is sent to intelligence machine mankind's brain Semantic interaction platform to analyze, obtains optimum answer with its knowledge base for corresponding to situation, It returns again to and carries out voice data synthesis to cloud speech synthesis platform, final intelligent robot will synthesize language in a manner of lifting up one's voice Sound is played out to complete intelligent human-machine interaction:

1) end-point detection is carried out to the voice data of acquisition based on the robot control system of ARM11, and by raw tone Data generate mp3 file format, send voice data to be identified to speech recognition platforms as unit of sentence；

2) after cloud class brain voice semantics recognition system receives voice data, it is decoded and speech recognition, is obtained Optimal recognition result is sent to intelligence machine mankind's brain Semantic interaction platform in a text form, while white light is communicated institute The location information and contextual model received sends the past；

3) intelligence machine mankind brain Semantic interaction platform carries out class brain according to the contextual model and contextual information received Semantic analysis, by choosing corresponding situation semantic base, and optimal feedback semantic data is therefrom matched, by it with text Form be sent to cloud speech synthesis platform；

4) speech synthesis platform in cloud carries out speech synthesis according to the text received, generates mp3 formatted voice file, and pass Back to the robot control system based on ARM11, after robot control system receives voice, by external audio output circuit into Row voice plays output, and continues to acquire and receive the voice signal of next step, completes lasting class brain intelligent semantic interaction.

Claims

1. a kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain, which is characterized in that including offline Voice collecting and identification hardware system, class brain semantics recognition and cognition hardware system and white light communication and indoor situation positioning system System, the offline voice collecting and identification hardware system are communicatively connected to class brain semantics recognition cognition hardware system and white respectively Optic communication and indoor situation positioning system,

The offline voice collecting and identification hardware system includes embedded control system, speech recognition module and audio processing Circuit, the embedded control system communicate to connect speech recognition module and audio frequency processing circuit respectively, it is each need into The place of row scene Recognition is provided with a speech recognition module and an audio frequency processing circuit；

The class brain semantics recognition cognition hardware system includes device for embedded control, remote communication module and long-range semantic knowledge Other device, the device for embedded control is communicatively connected to remote speech and semantic recognition device by remote communication module, embedding Enter formula control device and is also communicatively connected to offline voice collecting and identification hardware system；

The described white light communication and interior situation positioning system include multiple LED white light circuits and with LED white light circuit quantity phase Deng white light identification circuit, need the place for carrying out scene Recognition to be provided with a LED white light circuit and a use each In the luminous white light identification circuit of identification LED white light circuit, each white light identification circuit is communicatively connected to offline voice collecting and knowledge Other hardware system.

2. a kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain according to claim 1, It is characterized in that, the embedded control system of the offline voice collecting and identification hardware system includes the embedded system of STM32 System, the speech recognition module includes LD3320 speech recognition module, and the audio frequency processing circuit includes audio filtered electrical Road, audio amplifier circuit, multiple microphone arrays and multiple audio playing circuits, it is each that the place for carrying out scene Recognition is needed to pacify STM32 embedded system is connected to by audio amplifier circuit and audio filter circuit equipped with a microphone array, and, it is described LD3320 speech recognition module and multiple audio playing circuits be respectively connected to STM32 embedded system, it is each to need to carry out The place of scene Recognition is mounted on an audio playing circuit.

3. a kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain according to claim 1, It is characterized in that, the class brain semantic knowledge hardware system includes device for embedded control, remote communication module and long-range language Sound semantic recognition device, the device for embedded control include ARM11 embedded system, and the remote communication module includes WiFi communication module, 4G mobile communication module and WLan router, the long-range semantic recognition device include cloud voice language Adopted identifying platform, cloud intelligence machine mankind brain Semantic interaction platform and cloud speech synthesis platform, the ARM11 are embedded System is connected to WLan router by WiFi communication module or 4G mobile communication module, and cloud voice semantics recognition platform is successively Connect cloud intelligence machine mankind brain Semantic interaction platform and cloud speech synthesis platform, cloud Semantic interaction platform and cloud language Sound synthesis platform is connect with WLan router communication respectively, and ARM11 embedded system is connected to offline voice collecting and identification is hard The device for embedded control of part system.

4. a kind of intelligent robot Semantic interaction system based on white light communication and the cognition of class brain according to claim 1, It is characterized in that, the LED white light circuit of the white light communication and indoor situation positioning system includes white light LED array, LED gusts Column drive circuit, LED white light signal of communication modulation and demodulation circuit, white light driving and communication system STM32 controller, it is described White light LED array is set to be needed to carry out at the place of scene Recognition accordingly, the white light driving and communication system STM32 Controller by LED array driving circuit and LED white light signal of communication modulation and demodulation circuit come with white light LED array communication link It connects, the white light identification circuit includes high-speed photodiode sensor array and LED white light demodulator circuit, the high speed Photodiode sensor array is set to be needed to irradiate at the place for carrying out scene Recognition and by white light LED array accordingly, The input terminal of the LED white light demodulator circuit communicates to connect high-speed photodiode sensor array, output end communication connection To offline voice collecting and identification hardware system.