Disclosure of Invention
To achieve the above object, the present invention provides an artificial intelligence robot and a communication method cooperating with a human, which greatly saves precious wireless spectrum resources.
In order to achieve the aim, the invention provides an artificial intelligence robot cooperated with a human, which is characterized by comprising a voice recognition module, a first coding module and an image recognition module, wherein the voice recognition module generates a text unit string from received voice data or audio waveforms, the first coding module is used for coding each text unit in the text unit string and generating a binary instruction for controlling a servo mechanism of the robot; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
Preferably, the voice recognition module at least comprises a convolutional neural network, and the transformation module generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the convolutional neural network includes a plurality of convolutional layers that divide speech data or audio waveforms into a plurality of words to form strings of text units according to a time-frequency 2D spectrogram in a 3D spectrogram.
Preferably, the speech recognition module is configured to train the weights for each channel of the convolutional neural network from at least one sampled segment of the speech data or audio waveform.
Preferably, the artificial intelligence robot further comprises a second encoding module which encodes the characteristic information to generate a binary string to be transmitted.
Preferably, each robot arm of the robot servo is driven by a bearingless motor using a winding having both functions of generating torque and magnetic support force, which is selectively caused to generate support force or torque by corresponding to a rotor rotation angle.
In order to achieve the purpose, the invention also provides a communication method which comprises the steps of generating a text unit string by utilizing the voice recognition module to receive voice data or audio waveforms, and generating a first binary character string to be sent by utilizing a first coding module to code each text unit in the text unit string; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
Preferably, the voice recognition module at least comprises a convolutional neural network, and the transformation module generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the convolutional neural network includes a plurality of convolutional layers that divide speech data or audio waveforms into a plurality of words to form strings of text units according to a time-frequency 2D spectrogram in a 3D spectrogram.
Preferably, the speech recognition module is configured to train the weights for each channel of the convolutional neural network from at least one sampled segment of the speech data or audio waveform.
Compared with the prior art, the artificial intelligent robot cooperating with the human and the communication method provided by the invention have the advantages that the voice is divided into the text units at the transmitting end, the text units are coded to generate the binary code strings to be transmitted, the attention area is generated from the image information to be transmitted, the characteristic information related to the attention object is extracted from the corrected attention area, and then the coding information of the characteristic information related to the attention object is carried out, so that the code stream to be transmitted is greatly reduced, the voice and video coding rate is reduced, and the precious wireless spectrum resources are greatly saved.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In the present invention, the term "comprising" means "including but not limited to", unless otherwise defined.
The terms "speech recognition module", "encoding module", "decoding module", "image recognition module" and "AI module" each refer to a device configured to be implemented by hardware or software as an integrated circuit having programmed functions, the integrated circuit "containing electronic circuitry on a semiconductor material (e.g., silicon) for performing certain functions. For example, the integrated circuit may be a microprocessor, a Programmable Array Logic (PAL) device, an Application Specific Integrated Circuit (ASIC), or the like.
Fig. 1 is a block diagram of an electric control system of an artificial intelligence robot cooperating with a person according to the present invention, and as shown in fig. 1, the electric control system of the artificial intelligence robot cooperating with the person includes a sound collector 1, a camera 2, a processor 6, a servo driver 5, a communication subsystem 7, a memory, a sound generation device 3 and a display screen 4, wherein the sound collector 1 is used for converting audio information into audio waveform electric information, such as a microphone. The camera 2 is a device for converting optical information into an electrical image, and may be, for example, an infrared camera. The memory is used for storing programs, data and a corpus image database 9. The processor 9 calls programs and implements functions of speech recognition, text encoding, image recognition, image encoding, decoding, speech synthesis, image synthesis. The electric control system of the artificial intelligent robot cooperating with the human further comprises a voice transformation module 61, which is used for converting the audio waveform generated by the sound pick-up or the voice data taken out from the memory into time-frequency-intensity 3D atlas voice data, namely, the voice transformation module 61 carries out framing, windowing, Fourier transformation and logarithm taking on the time domain signal of the voice source to obtain a 3D atlas. The speech recognition module 62 generates an independent text unit string according to a time-frequency 2D map in the 3D map, and the text encoding module 63 is configured to encode each text unit in the text unit string to generate a non-binary code string, and then convert the non-binary code string into a binary character string to be transmitted and instruction information for controlling the robot servo driver 5. The image recognition module 64 generates feature information of the object of interest from the captured image. The image encoding module 65 encodes the feature information of the object of interest to generate a binary string to be transmitted or instruction information for controlling the servo driver 5. The decoding module 67 is used for decoding the binary information sent by the user terminal to generate the characteristic information of the image and the coded information corresponding to the text, the AI module 66 synthesizes the characteristic information of the image and the background image information taken out from the corpus image library to generate the image information and displays the image through the display screen, and also takes out the voice information from the corpus image library according to the coded information to synthesize the audio information and send the audio information through the sound-making device 3 to communicate with the personnel on site; the AI module 66 also synthesizes audio information of the response scene personnel from the corpus image repository 9 based on the voice recognition result of the voice recognition module 62 and makes a sound through the sound making device 3.
In the present invention, each robot arm of the robot servo is driven by a bearingless motor using a winding having two functions of generating torque and magnetic supporting force, and selectively generating the supporting force or torque by corresponding to a rotor rotation angle.
The electronic control system of the artificial intelligent robot cooperating with the human further comprises a communication subsystem 7 which comprises a baseband unit 71 and a radio frequency unit 72, wherein the baseband unit performs channel coding on the binary character string to be transmitted, converts the binary character string into a binary sequence to be transmitted, maximizes the average information amount loaded by each code element of the binary sequence to be transmitted, simultaneously ensures correct information transmission, performs channel decoding on the received binary sequence, converts the binary sequence into binary information transmitted by the user terminal, and provides the binary information to the processor 6. The radio frequency unit 72 includes a modulator for modulating a signal output from the baseband unit to a high frequency, a final power amplifier for amplifying the signal output from the modulator, an output filter for matching an output impedance of the final power amplifier with an input impedance of the antenna 8, and an antenna for converting an electric signal amplified by the final power to a magnetic signal and transmitting to a space; the radio frequency unit further includes a small signal amplifier for amplifying the electric signal received by the antenna 8, a mixer for down-converting the signal amplified by the small signal amplifier and the local oscillator signal generated by the local oscillator to form an intermediate frequency signal, and an analog-to-digital converter for performing analog-to-digital conversion on the intermediate frequency signal to form a data signal.
According to one embodiment of the present invention, the speech recognition module 62 includes at least a Convolutional Neural Network (CNN) including a plurality of convolutional layers that converts speech data or audio waveforms to be transmitted into a plurality of words forming text unit strings according to a time-frequency 2D spectrogram in a 3D spectrogram.
In the present invention, the transmitting device generates a time-frequency-intensity 3D sequence of the received speech using the speech transformation module 61. For example, each time-frequency-intensity 3D sequence may be a spectrogram. The 3D spectrogram may include an array of pixels (x, y, z).
Fig. 2 is a time-frequency-intensity 3D map provided by the present invention, as shown in fig. 2, x represents time in a segment of an audio waveform, y represents frequency in a segment of an audio waveform, and z represents that each pixel (x, y) has a value representing audio intensity of a segment of an audio waveform at time x and frequency y. Additionally, the speech recognition module provided by the present invention may optionally generate a mel-frequency cepstrum (MFC) based on the time-frequency array such that each pixel in the time-frequency array becomes an MFCC coefficient (MFCC), i.e., a z-value. In some cases, the MFCC array may provide a uniformly distributed power spectrum for data encoding, which may allow the voicing module to extract speaker-independent features. Each time-frequency 2D array may represent a 2D spectrogram of the speech signal at a time step. In a sound scenario, in speech recognition, each time step in the time-frequency 2D array sequence may be chosen to be small to capture certain transient characteristics of the speech signal.
The time steps of the time axis x in a time-frequency 2D spectrogram can be equally spaced, e.g., 10ms or 50ms, in a speech application, in other words, each 2D spectrogram in a sequence can represent a time-frequency array in a 10ms or 50ms span. The duration represents a time period in an audio waveform of the speech signal. The sequence of time-frequency 2D arrays may be loaded into the first layer of the CNN of the speech recognition module. The step of time in the intensity axis z may allow the first layer in CNN to see more samples in a small time window. However, each time-frequency 2D array in the sequence may have a low resolution, which will allow the CNN layer to include data covering a longer time span in the audio waveform, and as a result, the accuracy of speech recognition may be improved. Because the filter in CNN can cover a longer time frame, it can capture some transient characteristics of speech, such as "pitch", short or long sounds, etc.
In the present invention, the CNN training method may include: receiving a set of sample training speech data, which may include one or more segment audio waveforms; and generating one or more sequences of sample 3D time-frequency-intensities using the set of sample training speech data. The CNN training process may further include: one or more weights of the CNN are trained using one or more sequences of the sample 3D spectrogram, the trained weights to be used to generate speech recognition results. In training one or more weights of the CNN, the identification method may include: for each set of sample training speech data, receiving an indication of a class to which the sample training speech data belongs. The type of class and the number of classes depend on the speech recognition task. For example, it is designed to recognize whether the speech is from a male or female speaker. The speech recognition task may comprise a binary classifier assigning any input data to a class of male or female speakers, and accordingly the training process may comprise receiving an indication of whether the sample of each training sample is from a male or female speaker. The speech recognition task may also be designed to verify the identity of the speaker based on the speaker's speech. The speech recognition task may be designed to recognize the content of the speech input, such as syllables, words, phrases or sentences. In each of these cases, the CNN may include a multi-class classifier that assigns each input speech data segment into one of a plurality of classes.
Alternatively, in some scenarios, the speech recognition task may include feature extraction, where the speech recognition results may include vectors that may be invariant for a given class of samples. In CNN, similar methods can be used for both training and recognition. For example, the system may use any fully connected layer in CNN.
The speech recognition module of the present invention may adopt any speech recognition module in the prior art, fig. 3 is a block diagram of the speech recognition module provided in the present invention, and as shown in fig. 3, the speech recognition module includes a Convolutional Neural Network (CNN) which uses a time-frequency 2D spectrogram as an input, and implements modeling of the whole sentence to decompose a speech segment into a text unit string by a combination of a large number of convolutional layers and pooling layers.
The Convolutional Neural Network (CNN) has five convolutional layers, three pooling layers, two fully-connected layers and a regression layer, the first convolutional layer 21-1 convolves the 2D spectrogram with a Con3 × 3 convolutional core, which has 32 filters, outputs 32 features, and then extracts the maximum parameter using the first max pooling 22-1; the second convolution layer 22-1 convolves the spectrogram output by the first max pooling layer with 64 filters using the Con3 × 3 convolution kernel, which outputs 64 features, and then extracts the maximum parameter using the second max pooling 22-2; the third convolution layer 23-1 convolves the spectrum of the second max pooling layer output with a convolution kernel of Con3 × 3, which has 128 filters outputting 128 features; the fourth convolution layer 23-2 convolves the spectrogram output by the third convolution layer with a Con3 × 3 convolution kernel, which has 128 filters and outputs 128 features; the fifth convolutional layer 23-3 uses Con3 × 3 convolutional core to convolve the spectrogram output by the fourth convolutional layer, which has 128 filters to output 128 features, then uses the third maximal pooling 23-4 to extract the maximal parameter, finally accesses the two fully-connected layers 24-1 and 24-2 connected in sequence, and finally enters the regression layer 25 to regress for text unit differentiation. The speech recognition module 62 may use the last fully connected layer to store the feature vectors. Various configurations are possible depending on the size of the feature vector. Large feature vectors may result in large capacity and high accuracy of the classification task, while too large feature vectors may reduce the efficiency of performing the speech recognition task.
According to an embodiment of the present invention, the artificial intelligence robot electric control system further includes an image recognition module using a Convolutional Neural Network (CNN), the image recognition module detecting an object of interest shown in the image Im using an image frame input by the camera as the image Im, and estimating a position of the detected object of interest, generating feature information according to the position of the object of interest.
Fig. 4 is a flow chart of the present invention for providing the work flow of an image recognition module, as shown in fig. 4, the recognition module at least comprises an image recognition module using a convolutional neural network, the image recognition module at least comprises: the image processing device comprises a generating unit, an acquiring unit, a correcting unit and an extracting unit, wherein the generating unit generates a feature map with the resolution becoming lower from the 1 st level to the Nth level according to an input image, and generates a first feature map by using the feature map of the Nth level; an acquisition unit which detects an attention object photographed in the image and acquires position information of the attention object on the first feature map; a correction unit that corrects the position information so that the position information corresponds to a resolution of a second feature map that is a range of the attention object image on the feature map generated before the nth stage; an extraction unit configured to set a region of interest located at a position indicated by the corrected position information on the first feature map, and extract feature information on the object of interest from the region of interest.
For example, the generation unit includes an input layer 51 and an N-level feature extraction unit, where N is 2 or more, for example, N =5, the convolution layer 52-1 and the pooling layer 53-1 constitute level 1, the convolution layer 52-1 convolves the image input by the input layer 51 to generate 10 feature maps M1-M10, the size of the feature maps is the same as 1024 pixels × 1024 pixels of the image Im, and the pooling layer 53-1 pools 10 feature maps respectively to generate 1O feature maps M11-M20, the size of the feature maps is smaller than that of the feature maps M1-M10 and is 512 pixels × 512 pixels; the convolutional layer 52-2 and the pooling layer 53-2 form a level 2, the convolutional layer 52-2 performs convolution processing on 10 feature maps M11-M20 respectively to generate 10 feature maps M21-M30 with the size of 512 pixels × 512 pixels, and the pooling layer 53-2 performs pooling on 10 feature maps M21-M30 respectively to generate 1O feature maps M31-M40 with the size of 256 pixels × 256 pixels; the convolutional layer 52-3 and the pooling layer 53-3 form a level 3, the convolutional layer 52-3 performs convolution processing on 10 feature maps M31-M40 respectively to generate 10 feature maps M41-M50 with the size of 256 pixels × 256 pixels, and the pooling layer 53-3 performs pooling on 10 feature maps M41-M50 respectively to generate 1O feature maps M51-M60 with the size of 128 pixels × 128 pixels; the convolutional layer 52-4 and the pooling layer 53-4 form a level 4, the convolutional layer 52-4 performs convolution processing on 10 feature maps M51-M60 respectively to generate 10 feature maps M61-M70 with the size of 128 pixels × 128 pixels, and the pooling layer 53-4 performs pooling on 10 feature maps M61-M70 respectively to generate 10 feature maps M71-M80 with the size of 64 pixels × 64 pixels; the convolutional layer 52-5 and the pooling layer 53-5 constitute a 5 th level, the convolutional layer 52-5 performs convolution processing on 10 feature maps M71-M80 respectively to generate 10 feature maps M81-M90 with the size of 64 pixels × 64 pixels, and the pooling layer 53-5 performs pooling on 10 feature maps M81-M90 respectively to generate 10 feature maps M91-M100 with the size of 32 pixels × 32 pixels. In an alternative embodiment, there may be no pooling layer 53. As the resolution of the feature map M becomes lower from the 1 st level to the 5 th level, if the longitudinal size and the lateral size of the feature map M become half, the longitudinal size and the lateral size of the range S become half.
The RPN layer 54 detects the target of interest and the position information P thereof from the features of the feature map M91-M100. The RPN layer 54 has a function of an acquisition unit that detects the person OB photographed in the image Im using the first feature map generated at the last stage among the plurality of stages, and acquires the position information P of the person on the first feature map. In an embodiment, the first profile is profile M91-M100.
Referring to fig. 4, the selection unit 59 obtains the second feature map from a stage other than the first feature map obtained at the last stage. More specifically, the second feature map is the attention object image range S on the feature map M generated in the stage preceding the 5 th stage. The selection unit 59 switches the switches so that the attention object image range S (48 pixels × 48 pixels) on the feature maps M11-M20 obtained by the pooling layer 53-1 of the 1 st level, the attention object image range S (24 pixels × 24 pixels) on the feature maps M31-M40 obtained by the pooling layer 53-2 of the 2 nd level, the attention object image range S (12 pixels × 12 pixels) on the feature maps M51-M60 obtained by the pooling layer 53-3 of the 3 rd level, and the attention image range S (6 pixels × 6 pixels) on the feature maps M71-M80 obtained by the pooling layer 53-4 of the 4 th level.
For example, the attention image range S (12 pixels × 12 pixels) on the feature maps M51-M60 obtained by the 3 rd-level pooling layer 53-3 is selected as the second feature map and is referred to as the attention region R. Since the feature information F does not include information related to the position if the size of the region of interest R is too small, the lower limit value of the size of the region of interest R is determined in advance so that the information related to the position is included in the feature information F. Since the resolution of the feature map M decreases from the 1 st level to the 5 th level, the range S of the attention object (range to be detected) captured in the image Im also decreases from the 1 st level to the 5 th level.
Referring to fig. 5, the correction unit 58 corrects the position information P generated by the RPN layer 54. The reason is as follows: the position information P is the position information of the attention object image range S on the feature map M91-M100. The position information P is set to coordinates C1, C2, C3, and C4, for example.
In an embodiment, the resolution of the feature maps M51-M60 is higher than the feature maps M91-M100. Therefore, the correcting unit 58 shown in fig. 4 corrects the position information P on the first feature map so as to correspond to the resolution of the person image range (second feature map) on the feature maps M51-M60. The resolution of the image range of the object of interest on the M11-M20 feature map is 48 pixels by 48 pixels; the resolution of the object image range S of interest on the M31-M40 feature map is 24 pixels × 24 pixels; the resolution of the object image range S of interest on the M51-M60 feature map is 12 pixels × 12 pixels; the resolution of the object image range S of interest on the M71-M80 feature map is 6 pixels × 6 pixels; the resolution of the image range of interest on the M91-M100 feature image is 3 pixels by 3 pixels.
The correction unit 58 corrects the position information P on the first feature map so that the area of the region of interest R indicated by the position information P is enlarged by 4 times as shown in fig. 6. Specifically, the correcting unit 58 corrects the coordinate C1 to the coordinate C5, corrects the coordinate C2 to the coordinate C6, corrects the coordinate C3 to the coordinate C7, and corrects the coordinate C4 to the coordinate C8. The region of interest R whose position is determined by the coordinates C5, C6, C7, and C8 is centered on the position region formed by the coordinates C1, C2, C3, and C4.
The correction unit 58 transfers the first profile corrected with the position information P to the RoI pooling layer 55. The RoI pooling layer 55 functions as an extracting means for extracting the feature information F related to the target from the target region R.
The RoI pooling layer 55 pools the regions of interest R to represent feature information F1 to F10 related to the object of interest, and is shaped to have the same size, for example, 4 pixels × 4 pixels.
The above described RoI pooling is further detailed. As described above, the RoI pooling is a process of extracting the region of interest R and setting it as a feature map of a fixed size (for example, 4 pixels × 4 pixels), and this feature map M becomes the feature information F. For example, when the size of the region of interest R is 12 pixels × 12 pixels and the feature map (feature information F) of 4 pixels × 4 pixels is set, the RoI pooling layer 55 divides the region of interest R of 12 pixels × 12 pixels into a grid of 3 × 3. The same processing is performed even when the size of the region of interest R is not evenly divisible by the size of the grid.
Referring to fig. 4, the RoI pooling layer 55 sends the characteristic information F1-F10 to the full-joining layer 56. The full join layer 56 performs regression analysis on the feature information F1-F10 to generate a regression result RR, and then sends the regression result RR to the output layer 57. The output layer 57 sends the regression result RR to the image coding module 65 shown in fig. 1.
In the present invention, the resolution of the second feature map is higher than the resolution of the attention object range S on the first feature map. Therefore, the feature information F extracted from the region of interest R set on the second feature map contains more information on the position than the feature information F extracted from the region of the object of interest image S set on the first feature map. Therefore, if the feature information F extracted from the region of interest R set in the first feature map is used, accurate position information of each part of the field personnel can be estimated.
In the invention, because the robot transmits the text coding information and the coding information of the characteristic information which is extracted from the attention area and is related to the attention object to the remote user terminal, the binary code stream which needs to be transmitted is greatly reduced, thereby saving the wireless spectrum resource.
It will be readily understood that the overall solution of the invention as described in the description and in the drawings can be designed in a number of different configurations. Thus, the more detailed description of various implementations as represented in the specification and drawings is not intended to limit the scope of the disclosure, but is merely representative of various exemplary implementations. While various aspects of the present solution are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated. The described embodiments of the invention are to be considered in all respects only as illustrative and not restrictive. Therefore, the protection scope of the invention is: determined by the claims rather than the detailed description of the specification. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.