CN115145402A

CN115145402A - Intelligent toy system with network interaction function and control method

Info

Publication number: CN115145402A
Application number: CN202211063424.3A
Authority: CN
Inventors: 樊庆伟
Original assignee: Shenzhen Fumi Health Technology Co ltd
Current assignee: Shenzhen Fumi Health Technology Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-10-04

Abstract

The invention discloses an intelligent toy system with a network interaction function and a control method, wherein the intelligent toy system comprises a feature acquisition module, a learning module, a processing module, a tracking module and an identification module, the feature acquisition module acquires user features, the learning module trains a feature model in the processing module according to user feature data and classifies the feature data, the processing module selects interaction data matched with the current feature data according to the feature model after receiving the user feature data, and the interaction between a control toy and a user is carried out based on the interaction data. The toy control system can continuously train the feature model according to the user feature data in the process of collecting the user features, so that the system can identify a plurality of users, different interactions can be performed according to interests and hobbies of different users, and the applicability is higher.

Description

Intelligent toy system with network interaction function and control method

Technical Field

The invention relates to the technical field of toy control systems, in particular to an intelligent toy system with a network interaction function and a control method.

Background

The toy can be a natural object, namely, a non-artificial object such as sand, stone, mud, tree branches and the like, can be understood in a broad sense, is not limited to things sold on the street for people to play, can be called as a toy when being played, watched, listened and touched, is suitable for children, is more suitable for young and middle-aged people, and is a tool for opening a smart skylight, so that people can be intelligent and smart;

along with the development of the times, intelligent toys have appeared, are a market segment of toy categories, integrate some IT techniques and ancient toys together, are novel toys different from traditional toys, can interact with people, and are better in interactivity and deeply loved by people.

The prior art has the following defects: the existing intelligent toy control system can only still input control instructions and interact with users, however, because users have children, young people and middle-aged and old people, the control system can not select to make interaction matched with the users according to user characteristics (for example, after the control system samples voice data, the control system can only make response according to voice data content, and can not judge that the current users are children, young people or middle-aged and old people), and the applicability is poor.

Disclosure of Invention

The invention aims to provide an intelligent toy system with a network interaction function and a control method, so as to solve the defects in the background technology.

In order to achieve the above purpose, the invention provides the following technical scheme: the intelligent toy system with the network interaction function comprises a feature acquisition module, a learning module, a processing module, a tracking module and an identification module;

the feature acquisition module acquires user features, the learning module trains a feature model inside the processing module according to the user feature data and classifies the feature data, and the processing module selects interactive data matched with the current feature data according to the feature model after receiving the user feature data and interacts with a user based on the interactive data and the control toy.

Preferably, the feature acquisition module comprises a gesture acquisition unit, the gesture acquisition unit comprises skin color extraction, fingertip extraction and finger number identification, the skin color extraction is used for extracting the skin color of the hand, and the fingertip extraction and the finger number identification are used for identifying the edge of the hand and the number of fingers.

Preferably, the fingertip extraction and the finger number identification extract a binary image of the palm region through a convex hull, the convex hull is a convex polygon formed by connecting points on the outer layer of the binary image and is a convex hull, and the coordinates of the palm center of the convex hull are extracted through the following formula:

in the above formula, the first and second carbon atoms are,

is the first in the gesture area

The coordinate values of the individual pixel points,

is the total number of pixel points in the gesture area

The coordinates of the palm of the hand.

Preferably, the feature acquisition module further includes a speech acquisition unit, the speech acquisition unit includes text analysis, prosody processing, and speech synthesis, the text analysis is used for processing an input text, the prosody control is used for predicting prosody features of a synthesized speech, and the speech synthesis is used for processing the text features and prosody model parameters obtained through the text analysis and the prosody control.

Preferably, the voice recognition unit recognizes the voice including the steps of:

filtering out secondary information and environmental noise in an original voice signal;

analyzing a voice waveform and extracting a voice time sequence characteristic sequence;

and inputting the obtained voice characteristic parameters into an acoustic model for continuous training to obtain a model matched with a training output signal.

Preferably, the feature data includes user gesture data and user voice data acquired by the feature acquisition module.

Preferably, the tracking module is further included: when the user characteristics are acquired by the characteristic acquisition module, the tracking module continuously tracks the user characteristics, so that the characteristic acquisition module tracks and acquires the user characteristics.

Preferably, the tracking module is a tracking camera, the characteristic acquisition module acquires user characteristics, the tracking camera divides a user activity area, the tracking module continuously tracks the user in the activity area, the characteristic acquisition module continuously acquires the characteristics, and the tracking camera stops tracking after the user moves out of the activity area.

Preferably, the processing module comprises a processor and a signal transceiver, the signal transceiver is electrically connected with the processor, the processor is used for processing the characteristic data, and the processor is wirelessly connected with the mobile phone terminal through the signal transceiver based on a WiFi network.

The invention also provides a control method of the intelligent toy with the network interaction function, which comprises the following steps:

s1: collecting user characteristic data;

s2: training a feature model according to the user feature data;

s3: classifying the feature data;

s4: and selecting interactive data matched with the current characteristic data according to the characteristic model, and interacting with the control toy and the user based on the interactive data.

In the technical scheme, the invention provides the following technical effects and advantages:

1. according to the invention, the characteristic acquisition module is used for acquiring user characteristics, the learning module is used for training the characteristic model in the processing module according to the user characteristic data and classifying the characteristic data, the processing module is used for selecting interactive data matched with the current characteristic data according to the characteristic model after receiving the user characteristic data, and the toy is controlled to interact with the user based on the interactive data.

2. The system extracts the binary image of the palm area through the convex hull, the convex hull is a convex polygon formed by connecting outermost points, the convex hull comprises all points in a point set, a covexhull function is provided by OpenCV for determining the positions of the fingertips to search each vertex of the convex polygon, the positions of the palm and the fingertips can be accurately identified through the steps, the number of the fingers can be obtained by counting the number of blue circles, and therefore the acquisition precision of system characteristic data is improved.

3. The system of the invention automatically generates continuous voice according to the rule by analyzing the input text data, and the pronunciation rule synthesis method has the advantages that the sentence with infinite vocabulary can be synthesized after the accurate and fine pronunciation rule is established, so that the system has great plasticity and self-adaptability.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a block diagram of the system of the present invention.

FIG. 2 is a flow chart of the gesture capturing unit according to the present invention.

FIG. 3 is a flow chart of the speech recognition unit of the present invention.

Fig. 4 is a schematic diagram of an overall framework of the voice collecting unit according to the present invention.

FIG. 5 is a schematic diagram of the convolutional neural network of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, the intelligent toy system with network interaction function according to this embodiment includes a feature collecting module, a learning module, a processing module, a tracking module, and an identifying module;

wherein, the first and the second end of the pipe are connected with each other,

a characteristic acquisition module: the system is used for collecting user characteristics;

a learning module: training a feature model inside the processing module according to the user feature data, and classifying the feature data;

a processing module: the interactive toy is used for receiving the user characteristic data, selecting interactive data matched with the current characteristic data according to the characteristic model, and interacting with the user based on the interactive data and the control toy;

a tracking module: in the process of acquiring the user characteristics by the characteristic acquisition module, the tracking module continuously tracks the user characteristics, so that the characteristic acquisition module tracks and acquires the user characteristics;

an identification module: the type of the feature data is identified, the auxiliary learning module classifies the feature data, and the toy control system can continuously train the feature model according to the feature data of the user in the process of collecting the features of the user, so that the system can identify a plurality of users and perform different interactions according to interests and hobbies of different users, and the applicability is higher.

The processing module establishes a feature model based on a function polyfit (), the method for establishing the model by the function polyfit () belongs to the prior art, and details are not repeated herein, and the feature data comprises user gesture data and user voice data acquired by the feature acquisition module.

The processing module comprises a processor and a signal transceiver, the signal transceiver is electrically connected with the processor, the processor is used for processing characteristic data, and a user can use a mobile phone terminal and the signal transceiver to realize network interaction based on WiFi network wireless connection so as to control the processor to operate through a mobile phone APP.

The tracking module is a tracking camera, after the characteristic acquisition module acquires the characteristics of the user, the tracking camera divides an active area according to the user, the tracking module continuously tracks the user after the active area, so that the characteristic acquisition module continuously acquires the characteristics, and after the user moves out of the active area, the tracking camera stops tracking.

The feature acquisition module comprises a gesture acquisition unit;

wherein the content of the first and second substances,

gesture collection unit: the gesture collection unit comprises skin color extraction, fingertip extraction and finger number identification, wherein the skin color extraction is used for extracting the skin color of a hand, and the fingertip extraction and the finger number identification are used for identifying the edge of the hand and the number of fingers;

(1) Skin color extraction: the hand complexion is extracted to the gesture collection unit based on YCrCb color space, and YCrCb color space has the characteristics of separating chroma and brightness, and is relatively good to the clustering characteristic of complexion, receives little influence of brightness change, can be fine distinguish the complexion region, and the distribution range of human skin color in YCrCb chroma space is roughly: cb is more than or equal to 77 and less than or equal to 127, cr is more than or equal to 133 and less than or equal to 173, the range is selected as a threshold value of skin color segmentation, and the following is a conversion formula of an RGB color space and a YCrCb color space:

considering that background noise may interfere with hand extraction in the actual image acquisition process, therefore, the influence similar to skin color background noise needs to be eliminated, openCV provides a function for searching for a connected region, and can return the labels and pixel points of each connected region.

And in order to avoid the situation that the background area similar to skin color is mistakenly understood as the gesture area by the gesture acquisition unit when the hand is not in the image capturing range, the gesture area can be correctly divided without marking when the maximum connected area is smaller than 5000 pixel points.

(2) Fingertip extraction and finger number identification: the binary image of the palm region extracted by the convex hull is formed by a point set of a two-dimensional plane in a computer, the convex hull is a convex polygon formed by connecting outermost points, all the points in the point set are contained, and a contevexHull function is provided by OpenCV for determining the position of a fingertip to search each vertex of the convex polygon.

In the actual operation process, some fingertips are repeatedly marked and some irrelevant areas are also marked, which brings trouble to accurately identify the number of the fingertips, therefore, the point regulations which do not meet the requirements need to be removed, when the distance between the vertexes of the convex hull is less than 500 pixel points, only one of the fingertips is marked, and the area which is lower than the palm coordinate is not marked, wherein the palm coordinate extraction formula is as follows:

in the above formula, the first and second carbon atoms are,

is the first in the gesture area

The coordinate value of each pixel point is calculated,

is the total number of pixel points in the gesture area

The positions of the palm and the fingertips can be accurately identified through the steps, and the number of the fingers can be obtained by counting the number of the blue circles.

Referring to fig. 2, the gesture information of the motion is recognized by frame difference method, and the program sets two variables, namely Hand and Count, to respectively mark whether to capture the number of frames that the Hand enters and the number of frames that the Hand enters, including the following cases:

(1) (Count =0, hand = 0): no hand appears in the current image and the previous frame of image, which indicates that no hand enters or exits the camera capturing area in the period of time and no processing is performed.

(2) (Count =0, hand = 1): no hand appears in the current image, and the hand appears in the last frame, which indicates that the hand just leaves the camera capturing area, and the image information is saved.

(3) (Count =1, hand = 1): the current image has a hand present and is the first frame image, indicating that the hand has just entered the image capture area, saving the image information.

(4) (Count = K, hand = 1): indicating that the hand is always in the image capture area.

Combining the condition (2) and the condition (3), the four gesture movement directions of up, down, left and right can be recognized by judging the change condition of the gesture center coordinates of the first frame and the last frame, and the recognition process can be represented by the following formula:

in the above formula, the first and second carbon atoms are,

and

coordinates representing the centers of the gestures of the first frame and the last frame respectively,

is the angle between two points by judgment

The general direction of the gesture movement can be known, the first frame and the last frame are selected to replace a series of image frames from the time when the hand enters a capture area to the time when the hand leaves the capture area, the complexity of programming is also reduced, and at most eight movement directions can be recognized by calculating the tangent value of the included angle of the palm coordinates of the two frames of images, so that the recognition effect is good.

For the case (4), two kinds of dynamic gesture information, namely a hand and a fist, are recognized through an inter-frame difference method, the hand can be defined as that the number of fingers in continuous W frames is 5, the number of fingers in subsequent continuous W frames is kept unchanged, the fist can be defined as that the number of fingers in continuous Y frames is 5, values of W, Y and T for reducing T pixel points in subsequent Y frames need to be set according to specific conditions of a system, the set threshold value is W = Y =20, T =1000, values of W, Y and T are immediately cleared and counted again when any frame does not meet the conditions, and the recognition of the fist can be completed.

Example 2

The feature acquisition module also comprises a voice acquisition unit;

the voice acquisition unit comprises text analysis, rhythm processing and voice synthesis;

(1) The text analysis is to process the input text, the computer understands the text, knows what pronunciation and how each word should pronounce, and determines the words, phrases and sentences in the text, the text analysis module firstly carries out standardization processing to the input text, checks spelling error and filters out some words which are not standard or can not pronounce, then carries out word segmentation operation according to language and grammar rules, determines the boundaries of the words in the text, determines pronunciations of polyphonic characters and proper nouns under the action of a dictionary, and finally determines pronunciation tone and the transformation of pronunciation mood at different moments according to the text characteristics of the text context relationship and punctuation marks appearing at different positions in the text.

(2) The prosody control module is mainly used for predicting prosodic features of synthesized voice, corresponding prosodic information (such as intonation, rhythm and accent) in the voice is expressed by the prosodic features (such as fundamental frequency, duration and frequency spectrum), firstly, the prosodic control module collects a large amount of voice and text information data to establish a database, then extracts specific prosodic parameters according to the prosodic features in the voice, and finally inputs the prosodic parameters into a prosodic model to train and continuously perfect model parameters.

(3) The speech synthesis is further processing after text characteristics and prosody model parameters are obtained through text analysis and prosody control, the speech synthesis module is realized through an acoustic model, and the model synthesizes the final speech meeting the requirements by using a parameter synthesizer.

The voice synthesis method comprises parameter synthesis, splicing synthesis and pronunciation rule-based synthesis.

The parametric synthesis method is also called as an analysis synthesis method, and an acoustic model is usually generated by simulating the vocal tract characteristics of the human mouth, and the process of synthesizing the speech is as follows:

(1) Recording the recording of all possible pronunciations of human according to a certain language, analyzing a voice signal according to a certain method, and extracting acoustic parameters of the voice;

(2) And during synthesis, proper acoustic parameters are selected from the sound library according to the requirement of the synthesized sound, and are sent to a parameter synthesizer together with the prosodic parameters obtained from the prosodic model, and finally the synthesized voice is obtained.

The advantage of the parametric speech synthesis method is that the acoustic library stores the encoded acoustic parameters, so the required storage space is generally small, and the whole speech synthesis system can adapt to a very wide prosodic feature range.

The splicing synthesis method is different from a parameter synthesis method for storing acoustic parameters of voice, a sound library stores natural voice waveforms of synthesized voice units, proper splicing units are extracted from the sound library during synthesis, continuous synthesized voice is formed through a splicing algorithm and prosody modification, and the splicing synthesis method is characterized in that the splicing units are from the sound library, so the capacity of the library is large, the complexity is reduced by fine design, but the splicing units are obtained from the natural voice waveforms and replace the encoded acoustic parameters with the natural voice waveforms, so the synthesized voice effect is superior to the parameter synthesis method in the aspects of tone quality and naturalness.

Referring to fig. 3, the voice recognition unit recognizes the voice, including the following steps:

(1) Pretreatment: the method filters out the secondary information, environmental noise and other influencing factors in the original voice signal, thus not only carrying out information compression and reducing the operation amount and memory space of the system, but also greatly reducing the error rate of the system, and generally dividing the preprocessing process into a plurality of stages, such as filtering sampling, pre-emphasis, framing, windowing, endpoint detection and the like.

(2) The voice feature extraction aims at analyzing voice waveforms and extracting voice time sequence feature sequences, the extraction of voice feature parameters is a core part of a voice recognition system and determines a final recognition effect, and the feature parameters have the following features:

(2.1) voice characteristics such as pronunciation characteristics and vocal tract characteristics can be well expressed;

(2.2) the dimensionality of the extracted feature vectors is lower as much as possible, and parameter vectors of each order have good mutual independence;

and (2.3) the characteristic parameters can be calculated by using an efficient algorithm, so that the system can realize the identification process in real time.

(3) Acoustic model and pattern matching: and inputting the obtained voice characteristic parameters into an acoustic model for continuous training to obtain an optimal model with the maximum probability of coincidence with a training output signal, and inputting the voice characteristics of unknown voice signals into the acoustic model for comparison and matching in the recognition process to obtain a final recognition result.

(4) Language model and language processing: the language model is a grammar network formed by recognizing voice commands or a statistical language model, and language processing can be used for analyzing grammar and semantics and determining the correctness of a recognition result.

The pronunciation rule synthesis method is based on how to generate the rule synthesis voice, the system stores the acoustic parameters of the minimum constitution unit of the voice, and the constitution rules among phonemes, syllables, words, phrases or sentences and various control rules of prosodic information such as intonation, accent, rhythm and the like.

The pronunciation acquisition unit is based on FM1288 voice processing chip realizes, and the voice processing chip need be handled noise and echo among the conversation process to provide high-quality conversation effect, the FM1288 chip utilizes the acoustics echo cancellation principle, gets rid of the acoustics echo of ambient noise, and the compatible extensive host computer treater of FM1288 chip, the design system integration of being convenient for, the key feature of FM1288 chip has:

an integrated Digital Signal Processor (DSP) including a hardware calculus accelerator, ROM, and RAM;

an integrated analog-to-digital converter (ADC) and a digital-to-analog converter (DAC);

providing an IIS/PCM multiplexed digital audio interface;

providing automatic gain control (PGA) and Dynamic Range Control (DRC);

a user-selectable dual-microphone input is provided to support full-duplex, echo-free communication, and noise suppression of upstream and downstream voice signals.

Referring to fig. 4, the voice acquisition unit uses an X1000 main chip as a core, expands a peripheral I/O interface to connect with a key and an indicator light, and performs voice transmission and data transmission with a BCM43438 bluetooth chip through a PCM interface and a UART interface, respectively, and uses a special echo cancellation and noise suppression chip FM1288 as a voice processing unit, an external microphone, and an audio power amplifier to form a high-efficiency and stable audio input and output system.

Example 3

The learning model of the learning module is divided into four types, namely supervised learning, unsupervised learning and semi-supervised learning, wherein,

and (3) a supervised learning algorithm: supervised learning refers to establishing a prediction model by training labeled data, wherein the supervised learning model belongs to a classification method which comprises a vector machine (SVM) model and A Neural Network (ANN) model;

the vector machine (SVM) model firstly maps original data samples into a higher-dimensional space through a kernel function, and secondly constructs a separating hyperplane with the maximum distance between the nearest data samples in each sample class for classification, and the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

is shown as

The number of the data samples is one,

representing the corresponding label;

representing a normal vector of the hyperplane, and determining the direction of the hyperplane;

representing a penalty factor;

for measuring constraint conflicts;

representing a kernel function;

for determining the offset of the hyperplane from the origin to the normal vector, the above concept can be translated into the convex quadratic programming problem below.

The neural network (ANN) model has various types according to its specific structure, including an error back propagation neural network, an extreme learning machine, a convolutional neural network, and the like, and among the simplest models in the conventional network, a multilayer perceptron is used.

The supervised learning algorithm further comprises regression analysis, and the future change trend of the dependent variable is predicted by fitting the relation between the dependent variable and the independent variable.

Unsupervised learning algorithm: dividing the samples into different groups and subsets according to similar attributes of the samples, wherein the unsupervised learning method comprises density-based clustering, division-based clustering, hierarchy-based clustering, grid-based clustering and the like;

density-based clustering: it is first necessary to define two parameters,

a neighborhood radius of sample points and a minimum number of points required to form a dense domain

；

First, for each sample point, it is found

Points in the neighborhood and find their core object, i.e. for the

Points in the neighborhood if the point contains at least

Each sample is regarded as a core object;

secondly, finding out connected components of the core object, and ignoring all non-core object points;

finally, from the current set of core objectsSelecting a core object from the pool and based thereon

The neighborhood assigns non-core object points to the cluster closest to them, and the other points are considered noise.

Clustering based on partitioning: the method is divided into different categories according to the characteristics of sample points and data similarity, and the measurement is usually performed based on the distance between the points, that is, the sample points in the same category are as far as possible, and the sample points in different categories are as far as possible, and the specific formula is as follows:

(1)

(2)

in the above-mentioned formula, the compound has the following structure,

representing data points

The category to which the user belongs to is,

as a result of the data points,

is the center of the cluster, and,

are initial data points; wherein the formula (2) is centered for each class

The method comprises the following steps of (1) repeatedly calculating, wherein the specific processing logic is as follows: firstly, randomly selecting

Using sample point as initial clustering center

And repeatedly executing the formula (1) and the formula (2) until the termination condition is reached.

Semi-supervised learning algorithm: the method comprises the steps of generating pseudo-label data by using a large amount of unlabeled data, training a classifier by matching with a small amount of data with real labels, calculating by taking an SVM (support vector machine) as a model, and finding a hyperplane which maximizes intervals after labeling all unlabeled samples, wherein a semi-supervised learning algorithm comprises self-training, cooperative training, semi-supervised SVM and the like.

In summary, supervised learning generally requires a long time of debugging, and parameters and model frameworks are selected repeatedly;

the theoretical basis of semi-supervised learning lies in the continuity and consistency of the distribution of the marked data and the unmarked data, so that the learning module can utilize the point to carry out effective structural learning and enhance the representation capability of the model;

therefore, in this embodiment, an unsupervised learning algorithm is preferably used as the learning module learning algorithm of the training module, and the feature data can be quickly identified.

Example 4

The recognition module recognizes the characteristic data based on a deep learning algorithm, namely deep learning, namely a learning network which superposes hidden layer numbers on the basis of a neural network;

the processing logic of the deep learning algorithm is as follows:

a system L is provided having n layers (L1.... Ln), with I as input and O as output, and the process can be expressed as: i = > L1= > L2= >.. = > Ln = > O, if output O is equal to input I, i.e. there is no loss of information after input I has passed this systematic change, which means there is no loss of information after input I has passed each layer Li, i.e. at any layer Li, it is another representation of the original information (i.e. input I);

thus, a series of hierarchical features of the input I, namely L1 \8230Ln, can be automatically obtained, deep learning is to stack a plurality of layers, and the output of the layer is used as the input of the next layer to realize hierarchical expression of the input information.

The deep learning algorithm comprises a convolutional neural network, the convolutional neural network reduces the number of parameters to be learned by utilizing a spatial relationship so as to improve the training performance of a general forward BP algorithm, a small part (a local perception region) of an image is used as the input of the lowest layer of a hierarchical structure, information is sequentially transmitted to different layers, each layer obtains the most significant characteristics of observation data through a digital filter, and the method can obtain the significant characteristics of the observation data with unchanged translation, scaling and rotation.

The convolutional neural network is a multilayer artificial neural network, each layer is composed of a plurality of two-dimensional planes, each plane is composed of a plurality of independent neurons, and the specific processing logic is as follows:

as shown in fig. 5, an input image is convolved with three filters and an applicable bias, three feature maps are generated in a C1 layer after convolution, then adjacent four pixels in the feature maps are grouped and added to obtain an average value, then a weighted value and a bias are performed, a feature map of three S2 layers is obtained through an activation function (Sigmoid function), the maps are filtered correspondingly to obtain a C3 layer, the layer generates S4 as with S2, and finally, the pixel values are rasterized and connected into a one-dimensional vector to be input into a conventional neural network to obtain an output;

the convolutional neural network comprises local receptive fields, weight sharing and time and space sampling, wherein,

local receptive field: some local features of the sample data can be found through the perception of the local area;

weight sharing: each layer in the convolutional neural network is composed of a plurality of feature maps, each feature map comprises a plurality of neural units, all the neural units of the same feature map share the same convolutional kernel (namely weight), and one convolutional kernel usually represents one class of features of a sample;

spatial sampling: the purpose of sampling the sample is primarily to shuffle the specific location of a feature because once a feature of the sample is found, then its specific location is not important, and the system is only concerned with the relative location of that feature to other features.

In this embodiment, a convolutional neural network is used as a deep learning algorithm of the recognition model, so that:

(1) The input image can be well matched with the topological structure of the network;

(2) Feature extraction and pattern classification can be performed simultaneously and generated in network training;

(3) The weight sharing can reduce the training parameters of the network, so that the neural network structure becomes simpler and the adaptability is stronger.

The feature data (gesture images) are accurately analyzed through a deep learning algorithm, the single-frame images are acquired by the feature acquisition module and are transmitted to the trained deep learning model, the model uses a Yolo algorithm to perform target detection on the content of the single-frame images, and the content of the single-frame images is analyzed, so that the detection accuracy and the learning capacity of the toy control system are improved.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists singly, A and B exist simultaneously, and B exists singly, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, and may be understood with particular reference to the former and latter contexts.

In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a variety of media that can store program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. Intelligent toy system with network interaction function, its characterized in that: the device comprises a characteristic acquisition module, a learning module and a processing module;

the characteristic acquisition module acquires user characteristics, the learning module trains a characteristic model in the processing module according to user characteristic data and classifies the characteristic data, the processing module selects interactive data matched with the current characteristic data according to the characteristic model after receiving the user characteristic data, and the interactive data and the control toy are interacted with the user;

the feature acquisition module comprises a gesture acquisition unit, the gesture acquisition unit comprises skin color extraction, fingertip extraction and finger number identification, the skin color extraction is used for extracting the skin color of a hand, and the fingertip extraction and the finger number identification are used for identifying the edge of the hand and the number of fingers;

the method comprises the following steps that a binaryzation picture of a palm area is extracted through a convex hull by fingertip extraction and finger quantity identification, the convex hull is a convex polygon formed by connecting outer-layer points of the binaryzation picture, and the coordinates of the palm center of the convex hull are extracted through the following formula:

in the above formula, the first and second carbon atoms are,

is the first in the gesture area

The coordinate values of the individual pixel points,

is the total number of pixel points in the gesture area

The coordinates of the palm of the hand.

2. The intelligent toy system with network interaction function of claim 1, wherein: the feature acquisition module further comprises a voice acquisition unit, wherein the voice acquisition unit comprises text analysis, prosody processing and voice synthesis, the text analysis is used for processing an input text, the prosody control is used for predicting prosody features of a synthesized voice, and the voice synthesis is used for processing the text features and prosody model parameters obtained through the text analysis and the prosody control.

3. The intelligent toy system with network interaction function of claim 2, wherein: the voice acquisition unit recognizes voice and comprises the following steps:

4. The intelligent toy system with network interaction function of claim 1, wherein: the feature data comprises user gesture data and user voice data acquired by a feature acquisition module.

5. The intelligent toy system with network interaction function of claim 1, wherein: still include the tracking module: when the user characteristics are acquired by the characteristic acquisition module, the tracking module continuously tracks the user characteristics, so that the characteristic acquisition module tracks and acquires the user characteristics.

6. The intelligent toy system with network interaction function of claim 5, wherein: the tracking module is a tracking camera, the characteristic acquisition module acquires user characteristics, the tracking camera divides a user activity area, the tracking module continuously tracks the user in the activity area, the characteristic acquisition module continuously acquires the characteristics, and the tracking camera stops tracking after the user moves out of the activity area.

7. An intelligent toy system with network interaction function as claimed in any one of claims 1-6, wherein: the processing module comprises a processor and a signal transceiver, the signal transceiver is electrically connected with the processor, the processor is used for processing the characteristic data, and the processor is wirelessly connected with the mobile phone terminal through the signal transceiver based on a WiFi network.

8. A control method of an intelligent toy with a network interaction function is characterized in that: the method comprises the following steps:

s1: collecting user characteristic data;

s2: training a feature model according to the user feature data;

s3: classifying the feature data;