CN111508495B - Artificial intelligent robot cooperating with human and communication method - Google Patents

Artificial intelligent robot cooperating with human and communication method Download PDF

Info

Publication number
CN111508495B
CN111508495B CN202010368696.9A CN202010368696A CN111508495B CN 111508495 B CN111508495 B CN 111508495B CN 202010368696 A CN202010368696 A CN 202010368696A CN 111508495 B CN111508495 B CN 111508495B
Authority
CN
China
Prior art keywords
feature map
image
recognition module
information
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010368696.9A
Other languages
Chinese (zh)
Other versions
CN111508495A (en
Inventor
郭振峰
来春丽
张海滨
连芷萱
王忠斌
闵松阳
杨嘉琪
张瑜佳
马志
席跃东
席跃君
李敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhiren (Beijing) Technology Co.,Ltd.
Original Assignee
Beijing Hualande Technology Consulting Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hualande Technology Consulting Service Co ltd filed Critical Beijing Hualande Technology Consulting Service Co ltd
Priority to CN202010368696.9A priority Critical patent/CN111508495B/en
Publication of CN111508495A publication Critical patent/CN111508495A/en
Application granted granted Critical
Publication of CN111508495B publication Critical patent/CN111508495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

An artificial intelligence robot cooperating with a person and a communication method. In the robot and the communication method, an image recognition module is used for carrying out image recognition through a convolutional neural network to generate characteristic information, and the characteristic information is encoded to generate a binary character to be sent. The artificial intelligent robot cooperating with the human and the communication method provided by the invention can not only identify the details of the image, but also save wireless resources.

Description

Artificial intelligent robot cooperating with human and communication method
Technical Field
The invention relates to an artificial intelligence robot cooperating with a human and a communication method, and belongs to the technical field of artificial intelligence.
Background
The robot can replace a person to perform various tasks in a dangerous place, for example, in an infectious disease place, thus preventing the person from being infected. However, the robot provided by the prior art needs to occupy a large amount of wireless spectrum when interacting with a remote user terminal.
Disclosure of Invention
To achieve the above object, the present invention provides an artificial intelligence robot and a communication method cooperating with a human, which greatly saves precious wireless spectrum resources.
In order to achieve the aim, the invention provides an artificial intelligence robot cooperated with a human, which is characterized by comprising a voice recognition module, a first coding module and an image recognition module, wherein the voice recognition module generates a text unit string from received voice data or audio waveforms, the first coding module is used for coding each text unit in the text unit string and generating a binary instruction for controlling a servo mechanism of the robot; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
Preferably, the voice recognition module at least comprises a convolutional neural network, and the transformation module generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the convolutional neural network includes a plurality of convolutional layers that divide speech data or audio waveforms into a plurality of words to form strings of text units according to a time-frequency 2D spectrogram in a 3D spectrogram.
Preferably, the speech recognition module is configured to train the weights for each channel of the convolutional neural network from at least one sampled segment of the speech data or audio waveform.
Preferably, the artificial intelligence robot further comprises a second encoding module which encodes the characteristic information to generate a binary string to be transmitted.
Preferably, each robot arm of the robot servo is driven by a bearingless motor using a winding having both functions of generating torque and magnetic support force, which is selectively caused to generate support force or torque by corresponding to a rotor rotation angle.
In order to achieve the purpose, the invention also provides a communication method which comprises the steps of generating a text unit string by utilizing the voice recognition module to receive voice data or audio waveforms, and generating a first binary character string to be sent by utilizing a first coding module to code each text unit in the text unit string; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
Preferably, the voice recognition module at least comprises a convolutional neural network, and the transformation module generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the convolutional neural network includes a plurality of convolutional layers that divide speech data or audio waveforms into a plurality of words to form strings of text units according to a time-frequency 2D spectrogram in a 3D spectrogram.
Preferably, the speech recognition module is configured to train the weights for each channel of the convolutional neural network from at least one sampled segment of the speech data or audio waveform.
Compared with the prior art, the artificial intelligent robot cooperating with the human and the communication method provided by the invention have the advantages that the voice is divided into the text units at the transmitting end, the text units are coded to generate the binary code strings to be transmitted, the attention area is generated from the image information to be transmitted, the characteristic information related to the attention object is extracted from the corrected attention area, and then the coding information of the characteristic information related to the attention object is carried out, so that the code stream to be transmitted is greatly reduced, the voice and video coding rate is reduced, and the precious wireless spectrum resources are greatly saved.
Drawings
FIG. 1 is a block diagram of the electrical control system of an artificial intelligence robot cooperating with a human provided by the present invention;
FIG. 2 is a time-frequency-intensity 3D plot provided by the present invention;
FIG. 3 is a block diagram of the components of a speech recognition module provided by the present invention;
FIG. 4 is a flow chart of the operation of the image recognition module provided by the present invention;
FIG. 5 is a first feature map with a range of an image of an object of interest;
fig. 6 is a first feature map in which the position information of the attention object is corrected.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In the present invention, the term "comprising" means "including but not limited to", unless otherwise defined.
The terms "speech recognition module", "encoding module", "decoding module", "image recognition module" and "AI module" each refer to a device configured to be implemented by hardware or software as an integrated circuit having programmed functions, the integrated circuit "containing electronic circuitry on a semiconductor material (e.g., silicon) for performing certain functions. For example, the integrated circuit may be a microprocessor, a Programmable Array Logic (PAL) device, an Application Specific Integrated Circuit (ASIC), or the like.
Fig. 1 is a block diagram of an electric control system of an artificial intelligence robot cooperating with a person according to the present invention, and as shown in fig. 1, the electric control system of the artificial intelligence robot cooperating with the person includes a sound collector 1, a camera 2, a processor 6, a servo driver 5, a communication subsystem 7, a memory, a sound generation device 3 and a display screen 4, wherein the sound collector 1 is used for converting audio information into audio waveform electric information, such as a microphone. The camera 2 is a device for converting optical information into an electrical image, and may be, for example, an infrared camera. The memory is used for storing programs, data and a corpus image database 9. The processor 9 calls programs and implements functions of speech recognition, text encoding, image recognition, image encoding, decoding, speech synthesis, image synthesis. The electric control system of the artificial intelligent robot cooperating with the human further comprises a voice transformation module 61, which is used for converting the audio waveform generated by the sound pick-up or the voice data taken out from the memory into time-frequency-intensity 3D atlas voice data, namely, the voice transformation module 61 carries out framing, windowing, Fourier transformation and logarithm taking on the time domain signal of the voice source to obtain a 3D atlas. The speech recognition module 62 generates an independent text unit string according to a time-frequency 2D map in the 3D map, and the text encoding module 63 is configured to encode each text unit in the text unit string to generate a non-binary code string, and then convert the non-binary code string into a binary character string to be transmitted and instruction information for controlling the robot servo driver 5. The image recognition module 64 generates feature information of the object of interest from the captured image. The image encoding module 65 encodes the feature information of the object of interest to generate a binary string to be transmitted or instruction information for controlling the servo driver 5. The decoding module 67 is used for decoding the binary information sent by the user terminal to generate the characteristic information of the image and the coded information corresponding to the text, the AI module 66 synthesizes the characteristic information of the image and the background image information taken out from the corpus image library to generate the image information and displays the image through the display screen, and also takes out the voice information from the corpus image library according to the coded information to synthesize the audio information and send the audio information through the sound-making device 3 to communicate with the personnel on site; the AI module 66 also synthesizes audio information of the response scene personnel from the corpus image repository 9 based on the voice recognition result of the voice recognition module 62 and makes a sound through the sound making device 3.
In the present invention, each robot arm of the robot servo is driven by a bearingless motor using a winding having two functions of generating torque and magnetic supporting force, and selectively generating the supporting force or torque by corresponding to a rotor rotation angle.
The electronic control system of the artificial intelligent robot cooperating with the human further comprises a communication subsystem 7 which comprises a baseband unit 71 and a radio frequency unit 72, wherein the baseband unit performs channel coding on the binary character string to be transmitted, converts the binary character string into a binary sequence to be transmitted, maximizes the average information amount loaded by each code element of the binary sequence to be transmitted, simultaneously ensures correct information transmission, performs channel decoding on the received binary sequence, converts the binary sequence into binary information transmitted by the user terminal, and provides the binary information to the processor 6. The radio frequency unit 72 includes a modulator for modulating a signal output from the baseband unit to a high frequency, a final power amplifier for amplifying the signal output from the modulator, an output filter for matching an output impedance of the final power amplifier with an input impedance of the antenna 8, and an antenna for converting an electric signal amplified by the final power to a magnetic signal and transmitting to a space; the radio frequency unit further includes a small signal amplifier for amplifying the electric signal received by the antenna 8, a mixer for down-converting the signal amplified by the small signal amplifier and the local oscillator signal generated by the local oscillator to form an intermediate frequency signal, and an analog-to-digital converter for performing analog-to-digital conversion on the intermediate frequency signal to form a data signal.
According to one embodiment of the present invention, the speech recognition module 62 includes at least a Convolutional Neural Network (CNN) including a plurality of convolutional layers that converts speech data or audio waveforms to be transmitted into a plurality of words forming text unit strings according to a time-frequency 2D spectrogram in a 3D spectrogram.
In the present invention, the transmitting device generates a time-frequency-intensity 3D sequence of the received speech using the speech transformation module 61. For example, each time-frequency-intensity 3D sequence may be a spectrogram. The 3D spectrogram may include an array of pixels (x, y, z).
Fig. 2 is a time-frequency-intensity 3D map provided by the present invention, as shown in fig. 2, x represents time in a segment of an audio waveform, y represents frequency in a segment of an audio waveform, and z represents that each pixel (x, y) has a value representing audio intensity of a segment of an audio waveform at time x and frequency y. Additionally, the speech recognition module provided by the present invention may optionally generate a mel-frequency cepstrum (MFC) based on the time-frequency array such that each pixel in the time-frequency array becomes an MFCC coefficient (MFCC), i.e., a z-value. In some cases, the MFCC array may provide a uniformly distributed power spectrum for data encoding, which may allow the voicing module to extract speaker-independent features. Each time-frequency 2D array may represent a 2D spectrogram of the speech signal at a time step. In a sound scenario, in speech recognition, each time step in the time-frequency 2D array sequence may be chosen to be small to capture certain transient characteristics of the speech signal.
The time steps of the time axis x in a time-frequency 2D spectrogram can be equally spaced, e.g., 10ms or 50ms, in a speech application, in other words, each 2D spectrogram in a sequence can represent a time-frequency array in a 10ms or 50ms span. The duration represents a time period in an audio waveform of the speech signal. The sequence of time-frequency 2D arrays may be loaded into the first layer of the CNN of the speech recognition module. The step of time in the intensity axis z may allow the first layer in CNN to see more samples in a small time window. However, each time-frequency 2D array in the sequence may have a low resolution, which will allow the CNN layer to include data covering a longer time span in the audio waveform, and as a result, the accuracy of speech recognition may be improved. Because the filter in CNN can cover a longer time frame, it can capture some transient characteristics of speech, such as "pitch", short or long sounds, etc.
In the present invention, the CNN training method may include: receiving a set of sample training speech data, which may include one or more segment audio waveforms; and generating one or more sequences of sample 3D time-frequency-intensities using the set of sample training speech data. The CNN training process may further include: one or more weights of the CNN are trained using one or more sequences of the sample 3D spectrogram, the trained weights to be used to generate speech recognition results. In training one or more weights of the CNN, the identification method may include: for each set of sample training speech data, receiving an indication of a class to which the sample training speech data belongs. The type of class and the number of classes depend on the speech recognition task. For example, it is designed to recognize whether the speech is from a male or female speaker. The speech recognition task may comprise a binary classifier assigning any input data to a class of male or female speakers, and accordingly the training process may comprise receiving an indication of whether the sample of each training sample is from a male or female speaker. The speech recognition task may also be designed to verify the identity of the speaker based on the speaker's speech. The speech recognition task may be designed to recognize the content of the speech input, such as syllables, words, phrases or sentences. In each of these cases, the CNN may include a multi-class classifier that assigns each input speech data segment into one of a plurality of classes.
Alternatively, in some scenarios, the speech recognition task may include feature extraction, where the speech recognition results may include vectors that may be invariant for a given class of samples. In CNN, similar methods can be used for both training and recognition. For example, the system may use any fully connected layer in CNN.
The speech recognition module of the present invention may adopt any speech recognition module in the prior art, fig. 3 is a block diagram of the speech recognition module provided in the present invention, and as shown in fig. 3, the speech recognition module includes a Convolutional Neural Network (CNN) which uses a time-frequency 2D spectrogram as an input, and implements modeling of the whole sentence to decompose a speech segment into a text unit string by a combination of a large number of convolutional layers and pooling layers.
The Convolutional Neural Network (CNN) has five convolutional layers, three pooling layers, two fully-connected layers and a regression layer, the first convolutional layer 21-1 convolves the 2D spectrogram with a Con3 × 3 convolutional core, which has 32 filters, outputs 32 features, and then extracts the maximum parameter using the first max pooling 22-1; the second convolution layer 22-1 convolves the spectrogram output by the first max pooling layer with 64 filters using the Con3 × 3 convolution kernel, which outputs 64 features, and then extracts the maximum parameter using the second max pooling 22-2; the third convolution layer 23-1 convolves the spectrum of the second max pooling layer output with a convolution kernel of Con3 × 3, which has 128 filters outputting 128 features; the fourth convolution layer 23-2 convolves the spectrogram output by the third convolution layer with a Con3 × 3 convolution kernel, which has 128 filters and outputs 128 features; the fifth convolutional layer 23-3 uses Con3 × 3 convolutional core to convolve the spectrogram output by the fourth convolutional layer, which has 128 filters to output 128 features, then uses the third maximal pooling 23-4 to extract the maximal parameter, finally accesses the two fully-connected layers 24-1 and 24-2 connected in sequence, and finally enters the regression layer 25 to regress for text unit differentiation. The speech recognition module 62 may use the last fully connected layer to store the feature vectors. Various configurations are possible depending on the size of the feature vector. Large feature vectors may result in large capacity and high accuracy of the classification task, while too large feature vectors may reduce the efficiency of performing the speech recognition task.
According to an embodiment of the present invention, the artificial intelligence robot electric control system further includes an image recognition module using a Convolutional Neural Network (CNN), the image recognition module detecting an object of interest shown in the image Im using an image frame input by the camera as the image Im, and estimating a position of the detected object of interest, generating feature information according to the position of the object of interest.
Fig. 4 is a flow chart of the present invention for providing the work flow of an image recognition module, as shown in fig. 4, the recognition module at least comprises an image recognition module using a convolutional neural network, the image recognition module at least comprises: the image processing device comprises a generating unit, an acquiring unit, a correcting unit and an extracting unit, wherein the generating unit generates a feature map with the resolution becoming lower from the 1 st level to the Nth level according to an input image, and generates a first feature map by using the feature map of the Nth level; an acquisition unit which detects an attention object photographed in the image and acquires position information of the attention object on the first feature map; a correction unit that corrects the position information so that the position information corresponds to a resolution of a second feature map that is a range of the attention object image on the feature map generated before the nth stage; an extraction unit configured to set a region of interest located at a position indicated by the corrected position information on the first feature map, and extract feature information on the object of interest from the region of interest.
For example, the generation unit includes an input layer 51 and an N-level feature extraction unit, where N is 2 or more, for example, N =5, the convolution layer 52-1 and the pooling layer 53-1 constitute level 1, the convolution layer 52-1 convolves the image input by the input layer 51 to generate 10 feature maps M1-M10, the size of the feature maps is the same as 1024 pixels × 1024 pixels of the image Im, and the pooling layer 53-1 pools 10 feature maps respectively to generate 1O feature maps M11-M20, the size of the feature maps is smaller than that of the feature maps M1-M10 and is 512 pixels × 512 pixels; the convolutional layer 52-2 and the pooling layer 53-2 form a level 2, the convolutional layer 52-2 performs convolution processing on 10 feature maps M11-M20 respectively to generate 10 feature maps M21-M30 with the size of 512 pixels × 512 pixels, and the pooling layer 53-2 performs pooling on 10 feature maps M21-M30 respectively to generate 1O feature maps M31-M40 with the size of 256 pixels × 256 pixels; the convolutional layer 52-3 and the pooling layer 53-3 form a level 3, the convolutional layer 52-3 performs convolution processing on 10 feature maps M31-M40 respectively to generate 10 feature maps M41-M50 with the size of 256 pixels × 256 pixels, and the pooling layer 53-3 performs pooling on 10 feature maps M41-M50 respectively to generate 1O feature maps M51-M60 with the size of 128 pixels × 128 pixels; the convolutional layer 52-4 and the pooling layer 53-4 form a level 4, the convolutional layer 52-4 performs convolution processing on 10 feature maps M51-M60 respectively to generate 10 feature maps M61-M70 with the size of 128 pixels × 128 pixels, and the pooling layer 53-4 performs pooling on 10 feature maps M61-M70 respectively to generate 10 feature maps M71-M80 with the size of 64 pixels × 64 pixels; the convolutional layer 52-5 and the pooling layer 53-5 constitute a 5 th level, the convolutional layer 52-5 performs convolution processing on 10 feature maps M71-M80 respectively to generate 10 feature maps M81-M90 with the size of 64 pixels × 64 pixels, and the pooling layer 53-5 performs pooling on 10 feature maps M81-M90 respectively to generate 10 feature maps M91-M100 with the size of 32 pixels × 32 pixels. In an alternative embodiment, there may be no pooling layer 53. As the resolution of the feature map M becomes lower from the 1 st level to the 5 th level, if the longitudinal size and the lateral size of the feature map M become half, the longitudinal size and the lateral size of the range S become half.
The RPN layer 54 detects the target of interest and the position information P thereof from the features of the feature map M91-M100. The RPN layer 54 has a function of an acquisition unit that detects the person OB photographed in the image Im using the first feature map generated at the last stage among the plurality of stages, and acquires the position information P of the person on the first feature map. In an embodiment, the first profile is profile M91-M100.
Referring to fig. 4, the selection unit 59 obtains the second feature map from a stage other than the first feature map obtained at the last stage. More specifically, the second feature map is the attention object image range S on the feature map M generated in the stage preceding the 5 th stage. The selection unit 59 switches the switches so that the attention object image range S (48 pixels × 48 pixels) on the feature maps M11-M20 obtained by the pooling layer 53-1 of the 1 st level, the attention object image range S (24 pixels × 24 pixels) on the feature maps M31-M40 obtained by the pooling layer 53-2 of the 2 nd level, the attention object image range S (12 pixels × 12 pixels) on the feature maps M51-M60 obtained by the pooling layer 53-3 of the 3 rd level, and the attention image range S (6 pixels × 6 pixels) on the feature maps M71-M80 obtained by the pooling layer 53-4 of the 4 th level.
For example, the attention image range S (12 pixels × 12 pixels) on the feature maps M51-M60 obtained by the 3 rd-level pooling layer 53-3 is selected as the second feature map and is referred to as the attention region R. Since the feature information F does not include information related to the position if the size of the region of interest R is too small, the lower limit value of the size of the region of interest R is determined in advance so that the information related to the position is included in the feature information F. Since the resolution of the feature map M decreases from the 1 st level to the 5 th level, the range S of the attention object (range to be detected) captured in the image Im also decreases from the 1 st level to the 5 th level.
Referring to fig. 5, the correction unit 58 corrects the position information P generated by the RPN layer 54. The reason is as follows: the position information P is the position information of the attention object image range S on the feature map M91-M100. The position information P is set to coordinates C1, C2, C3, and C4, for example.
In an embodiment, the resolution of the feature maps M51-M60 is higher than the feature maps M91-M100. Therefore, the correcting unit 58 shown in fig. 4 corrects the position information P on the first feature map so as to correspond to the resolution of the person image range (second feature map) on the feature maps M51-M60. The resolution of the image range of the object of interest on the M11-M20 feature map is 48 pixels by 48 pixels; the resolution of the object image range S of interest on the M31-M40 feature map is 24 pixels × 24 pixels; the resolution of the object image range S of interest on the M51-M60 feature map is 12 pixels × 12 pixels; the resolution of the object image range S of interest on the M71-M80 feature map is 6 pixels × 6 pixels; the resolution of the image range of interest on the M91-M100 feature image is 3 pixels by 3 pixels.
The correction unit 58 corrects the position information P on the first feature map so that the area of the region of interest R indicated by the position information P is enlarged by 4 times as shown in fig. 6. Specifically, the correcting unit 58 corrects the coordinate C1 to the coordinate C5, corrects the coordinate C2 to the coordinate C6, corrects the coordinate C3 to the coordinate C7, and corrects the coordinate C4 to the coordinate C8. The region of interest R whose position is determined by the coordinates C5, C6, C7, and C8 is centered on the position region formed by the coordinates C1, C2, C3, and C4.
The correction unit 58 transfers the first profile corrected with the position information P to the RoI pooling layer 55. The RoI pooling layer 55 functions as an extracting means for extracting the feature information F related to the target from the target region R.
The RoI pooling layer 55 pools the regions of interest R to represent feature information F1 to F10 related to the object of interest, and is shaped to have the same size, for example, 4 pixels × 4 pixels.
The above described RoI pooling is further detailed. As described above, the RoI pooling is a process of extracting the region of interest R and setting it as a feature map of a fixed size (for example, 4 pixels × 4 pixels), and this feature map M becomes the feature information F. For example, when the size of the region of interest R is 12 pixels × 12 pixels and the feature map (feature information F) of 4 pixels × 4 pixels is set, the RoI pooling layer 55 divides the region of interest R of 12 pixels × 12 pixels into a grid of 3 × 3. The same processing is performed even when the size of the region of interest R is not evenly divisible by the size of the grid.
Referring to fig. 4, the RoI pooling layer 55 sends the characteristic information F1-F10 to the full-joining layer 56. The full join layer 56 performs regression analysis on the feature information F1-F10 to generate a regression result RR, and then sends the regression result RR to the output layer 57. The output layer 57 sends the regression result RR to the image coding module 65 shown in fig. 1.
In the present invention, the resolution of the second feature map is higher than the resolution of the attention object range S on the first feature map. Therefore, the feature information F extracted from the region of interest R set on the second feature map contains more information on the position than the feature information F extracted from the region of the object of interest image S set on the first feature map. Therefore, if the feature information F extracted from the region of interest R set in the first feature map is used, accurate position information of each part of the field personnel can be estimated.
In the invention, because the robot transmits the text coding information and the coding information of the characteristic information which is extracted from the attention area and is related to the attention object to the remote user terminal, the binary code stream which needs to be transmitted is greatly reduced, thereby saving the wireless spectrum resource.
It will be readily understood that the overall solution of the invention as described in the description and in the drawings can be designed in a number of different configurations. Thus, the more detailed description of various implementations as represented in the specification and drawings is not intended to limit the scope of the disclosure, but is merely representative of various exemplary implementations. While various aspects of the present solution are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated. The described embodiments of the invention are to be considered in all respects only as illustrative and not restrictive. Therefore, the protection scope of the invention is: determined by the claims rather than the detailed description of the specification. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims (8)

1. An artificial intelligence robot cooperating with a person, comprising a voice recognition module, a first encoding module, and an image recognition module, wherein the voice recognition module generates a text unit string from received voice data or audio waveforms; the image recognition module performs image recognition through a first convolutional neural network, and is realized by generating a feature map with the resolution becoming lower from the 1 st level to the Nth level according to an input image and generating a first feature map by using the characteristic map of the Nth level; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the corrected position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
2. The artificial intelligence robot of claim 1, wherein the voice recognition module comprises at least a second convolutional neural network and a transformation module that generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the second convolutional neural network includes a plurality of convolutional layers that divide the speech data or audio waveform into a plurality of words to form a text unit string according to a time-frequency 2D spectrogram in the 3D spectrogram.
3. The artificial intelligence robot of claim 2, wherein the speech recognition module is configured to train a weight for each channel of the convolutional neural network from at least one sampled segment of the speech data or audio waveform.
4. The artificial intelligence robot of claim 3, further comprising a second encoding module that encodes the characterizing information to generate a binary string to be transmitted.
5. The artificial intelligence robot of claim 4, wherein each robot arm of the robot servo is driven by a bearingless motor using a winding having both functions of generating a torque and a magnetic supporting force, which is selectively caused to generate a supporting force or a torque by corresponding to a rotor rotation angle.
6. A communication method is characterized by comprising the steps of generating a text unit string by utilizing a speech recognition module to receive speech data or audio waveforms, generating a first binary character string to be transmitted by utilizing a first coding module to code each text unit in the text unit string, generating characteristic information by utilizing an image recognition module to generate characteristic information, and generating a second binary character string by utilizing a second coder to code the characteristic information, wherein the image recognition module performs image recognition by utilizing a first convolutional neural network and is realized by generating a characteristic diagram with the resolution becoming lower along with the change from a 1 st level to an N th level according to an input image and generating a first characteristic diagram by utilizing an N-th level characteristic diagram; detecting an attention object shot in the image, and acquiring position information of the attention object on the first feature map; correcting the position information so that the corrected position information corresponds to a resolution of a second feature map that is an area range containing the image of the object of interest on the feature map generated before the nth stage; an attention region located at a position indicated by the corrected position information is set on the first feature map, and feature information on the attention object is extracted from the attention region.
7. The communication method according to claim 6, wherein the voice recognition module comprises at least a second convolutional neural network and a transformation module, the transformation module generates a time-frequency-intensity 3D spectrogram from voice data or audio waveforms to be transmitted; the second convolutional neural network includes a plurality of convolutional layers that divide the speech data or audio waveform into a plurality of words to form a text unit string according to a time-frequency 2D spectrogram in the 3D spectrogram.
8. The communication method of claim 7, wherein the speech recognition module is configured to train a weight for each channel of the convolutional neural network based on at least one sampled segment of the speech data or audio waveform.
CN202010368696.9A 2020-05-02 2020-05-02 Artificial intelligent robot cooperating with human and communication method Active CN111508495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010368696.9A CN111508495B (en) 2020-05-02 2020-05-02 Artificial intelligent robot cooperating with human and communication method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010368696.9A CN111508495B (en) 2020-05-02 2020-05-02 Artificial intelligent robot cooperating with human and communication method

Publications (2)

Publication Number Publication Date
CN111508495A CN111508495A (en) 2020-08-07
CN111508495B true CN111508495B (en) 2021-07-20

Family

ID=71874971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010368696.9A Active CN111508495B (en) 2020-05-02 2020-05-02 Artificial intelligent robot cooperating with human and communication method

Country Status (1)

Country Link
CN (1) CN111508495B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281139A (en) * 2016-12-30 2018-07-13 深圳光启合众科技有限公司 Speech transcription method and apparatus, robot
CN108229268A (en) * 2016-12-31 2018-06-29 商汤集团有限公司 Expression Recognition and convolutional neural networks model training method, device and electronic equipment
CN106920545B (en) * 2017-03-21 2020-07-28 百度在线网络技术(北京)有限公司 Speech feature extraction method and device based on artificial intelligence
CN108269571B (en) * 2018-03-07 2024-01-09 佛山市云米电器科技有限公司 Voice control terminal with camera function
CN110738984B (en) * 2019-05-13 2020-12-11 苏州闪驰数控系统集成有限公司 Artificial intelligence CNN, LSTM neural network speech recognition system

Also Published As

Publication number Publication date
CN111508495A (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN110223705B (en) Voice conversion method, device, equipment and readable storage medium
CN111292764B (en) Identification system and identification method
JP2020515905A (en) Speaker confirmation method and speaker confirmation device
CN110136698A (en) For determining the method, apparatus, equipment and storage medium of nozzle type
CN111179911A (en) Target voice extraction method, device, equipment, medium and joint training method
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
WO2021203880A1 (en) Speech enhancement method, neural network training method, and related device
CN116051692B (en) Three-dimensional digital human face animation generation method based on voice driving
CN110765868A (en) Lip reading model generation method, device, equipment and storage medium
CN111916054A (en) Lip-based voice generation method, device and system and storage medium
US20230298616A1 (en) System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input with Haptic Output
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN113436609A (en) Voice conversion model and training method thereof, voice conversion method and system
CN109215688B (en) Same-scene audio processing method, device, computer readable storage medium and system
CN116895273B (en) Output method and device for synthesized audio, storage medium and electronic device
CN111508495B (en) Artificial intelligent robot cooperating with human and communication method
CN116994600B (en) Method and system for driving character mouth shape based on audio frequency
CN114387945A (en) Voice generation method and device, electronic equipment and storage medium
CN111562815B (en) Wireless head-mounted device and language translation system
CN114299981B (en) Audio processing method, device, storage medium and equipment
CN111199747A (en) Artificial intelligence communication system and communication method
CN116825081B (en) Speech synthesis method, device and storage medium based on small sample learning
CN116866783B (en) Intelligent classroom audio control system, method and storage medium
CN117238275B (en) Speech synthesis model training method and device based on common sense reasoning and synthesis method
CN117975991B (en) Digital person driving method and device based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Guo Zhenfeng

Inventor after: Xi Yuedong

Inventor after: Xi Yuejun

Inventor after: Li Min

Inventor after: Lai Chunli

Inventor after: Zhang Haibin

Inventor after: Lian Zhixuan

Inventor after: Wang Zhongbin

Inventor after: Min Songyang

Inventor after: Yang Jiaqi

Inventor after: Zhang Yujia

Inventor after: Ma Zhi

Inventor before: Lian Zhixuan

Inventor before: Song Weiqi

Inventor before: Wang Zhongbin

Inventor before: Min Songyang

Inventor before: Yang Jiaqi

Inventor before: Zhang Yujia

Inventor before: Ma Zhi

Inventor before: Xi Yuedong

Inventor before: Xi Yuejun

Inventor before: Li Min

GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Yi

Inventor after: Li Min

Inventor after: Guo Zhenfeng

Inventor after: Lai Chunli

Inventor after: Zhang Haibin

Inventor after: Lian Zhixuan

Inventor after: Xi Yuedong

Inventor after: Yang Jiaqi

Inventor after: Zhang Yujia

Inventor after: Ma Zhi

Inventor after: Wang Zhongbin

Inventor after: Min Songyang

Inventor after: Xi Yuejun

Inventor before: Guo Zhenfeng

Inventor before: Xi Yuedong

Inventor before: Xi Yuejun

Inventor before: Li Min

Inventor before: Lai Chunli

Inventor before: Zhang Haibin

Inventor before: Lian Zhixuan

Inventor before: Wang Zhongbin

Inventor before: Min Songyang

Inventor before: Yang Jiaqi

Inventor before: Zhang Yujia

Inventor before: Ma Zhi

TR01 Transfer of patent right

Effective date of registration: 20240718

Address after: Room A182, 1st Floor, Building 3, No. 18 Keyuan Road, Daxing District, Beijing 102600

Patentee after: Zhiren (Beijing) Technology Co.,Ltd.

Country or region after: China

Address before: 102200 Room 608, block C, building 2, courtyard 42, Qibei Road, Beiqijia Town, Changping District, Beijing

Patentee before: BEIJING HUALANDE TECHNOLOGY CONSULTING SERVICE CO.,LTD.

Country or region before: China