CN112543390A

CN112543390A - Intelligent infant sound box and interaction method thereof

Info

Publication number: CN112543390A
Application number: CN202011336049.6A
Authority: CN
Inventors: 岳莉亚; 胡沛; 韩璞; 韩凌; 杨植森
Original assignee: Nanyang Institute of Technology
Current assignee: Nanyang Institute of Technology
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-23
Anticipated expiration: 2040-11-25
Also published as: CN112543390B

Abstract

The invention provides an intelligent infant sound box and an interaction method thereof, wherein the intelligent infant sound box comprises a sound box body, a central processing unit, a storage and a network connector are arranged in the sound box body, a display screen is arranged on the surface of the sound box body, a voice acquisition module, an infant voiceprint acquisition module, a wake-up module, an output module and an intelligent control module are arranged in the central processing unit, a storage module is arranged in the storage, the output module is connected with the display screen through a circuit, and the intelligent control module is electrically connected with the voice acquisition module, the infant voiceprint acquisition module, the wake-up module, the storage module and the output; the voice acquisition module is used for acquiring adult voice information; the infant voiceprint acquisition module is used for acquiring infant voice signals; the awakening module is used for awakening the intelligent sound box through voice; the output module is used for responding to a user instruction, and the output content of the output module comprises sound and video; the intelligent control module is used for adult voice recognition, infant voice recognition, user instruction response and dynamic addition of infant awakening words.

Description

Intelligent infant sound box and interaction method thereof

Technical Field

The invention relates to the technical field of voice recognition technology and artificial intelligence, in particular to an intelligent infant sound box and an interaction method thereof.

Background

With the maturity of artificial intelligence technology and the development of speech recognition technology, intelligent sound boxes have begun to penetrate into people's daily life. The intelligent sound box not only has the functions of playing audio and video by the traditional voice equipment, but also has the functions of intellectualization, interaction, control and the like. The existing loudspeaker boxes popular in the market have good interactivity and intelligence, but have poor experience effects on infants who just learn to speak for a short time, such as overlong awakening words and incapability of correctly recognizing instructions of the infants.

The neural network simulates the thinking function of the human brain structure, has strong self-learning and association functions, high precision, less manual intervention and less utilization of expert knowledge. A typical neural network architecture comprises an input layer, one or more hidden layers, and an output layer. The meta-heuristic algorithm can find a global solution in a multi-dimensional search space, and is widely applied to parameter training of a neural network. However, the neural network also has inherent defects of easy falling into local optimum, low precision, slow learning speed and the like. The processor performance of the existing intelligent sound box is general, and the data processing capability is poor.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides an infant intelligent sound box and an interaction method thereof, wherein the infant intelligent sound box can intelligently distinguish adult awakening or infant awakening by optimizing neural network parameters through an improved algorithm.

The purpose of the invention can be realized by the following technical scheme: the intelligent infant sound box comprises a sound box body, wherein a central processing unit, a storage and a network connector are arranged in the sound box body, and a display screen is arranged on the surface of the sound box body; the voice acquisition module is used for acquiring adult voice information and comprises a plurality of single voice acquisition modules; the infant voiceprint acquisition module is used for acquiring infant voice signals; the awakening module is used for awakening the intelligent sound box through voice, and comprises an adult awakening module and an infant awakening module; the storage module is used for storing adult voice recognition information, awakening words, infant common instructions, infant historical browsing information and cache data; the output module is used for responding to a user instruction, and the output content of the output module comprises sound and video; the intelligent control module is used for adult voice recognition, infant voice recognition, user instruction response and dynamic addition of infant awakening words; the network connector is used for connecting the intelligent equipment with the Internet.

In foretell infant's intelligence audio amplifier, it is a plurality of single pronunciation collection module specifically includes first adult administrator pronunciation collection module, second adult administrator pronunciation collection module, third adult administrator pronunciation collection module, fourth adult administrator pronunciation collection module, fifth adult administrator pronunciation collection module and sixth adult administrator pronunciation collection module.

The voice acquisition module can acquire the voice information of six adults (parent, grandpa) altogether to after carrying out the discernment training through intelligent control module, these six adults can control the authority that the infant controlled intelligent audio amplifier.

The interaction method of the intelligent infant sound box comprises the following steps:

A. the method for recognizing adult speech comprises the following steps:

1) inputting adult sample voice;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting adult training voice;

5) extracting MFCC characteristic parameters;

6) and carrying out neural network speech recognition training through the neural network model constructed in the step 3), wherein the training method comprises the following steps:

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network;

d. calling compact wolf algorithm;

e. setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. predicting and outputting a test result by a neural network;

B. the method for recognizing the infant voice comprises the following steps:

1) inputting a sample voice of the infant;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting a training voice of the infant;

5) extracting MFCC characteristic parameters;

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network;

d. calling compact wolf algorithm;

e. setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. and predicting and outputting a test result by the neural network.

In the interaction of the intelligent infant sound box, the compact gray wolf algorithm comprises the following steps:

1) initializing relevant parameters, such as maximum iteration times Max _ iter being 500, upper Position limit ub being 1, lower Position limit lb being 0, and randomly generating a gray wolf Position; mu and sicma calculations are shown in equations (1) and (2):

mu＝zeros(3,dim)； (1)

sicma＝10*ones(3,dim)； (2)

mu and sicma represent the mean and variance of the Gaussian distribution, dim is the dimension of the search space, and the number of parameters of the optimized neural network is represented;

2) initializing alpha, beta, gamma wolf positions, and calculating the following formulas (3) to (5):

Alpha_pos＝ub*generateIndividualR(mu(1),sigma2(1))； (3)

Beta_pos＝ub*generateIndividualR(mu(2),sigma2(2))； (4)

Delta_pos＝ub*generateIndividualR(mu(3),sigma2(3))； (5)

generating a gray wolf position by a generateInividualR function according to the mean value and the variance of the Gaussian distribution;

3) the generatelndivualr (mu, sigma) function steps are calculated as follows (6) - (9):

r＝rand()； (6)

erfA＝erf((mu+1)/(sqrt(2)*sigma))； (7)

erfB＝erf((mu-1)/(sqrt(2)*sigma))； (8)

samplerand＝erfinv(-erfA-r*erfB+r*erfA)*sigma*sqrt(2)+mu； (9)

rand () generates a random variable of [0, 1 ]; erf () is an error function, which is the integral of the gaussian probability density function; sqrt is a function for square root; erfiv () represents the inverse error function; samplerand is a function return value;

4) calling an objective function as the following formula (10), and obtaining the objective function values of Alpha, Beta and gamma wolfs as Alpha _ score, Beta _ score and Delta _ score respectively;

n is the number of the neural network training samples, y is a training sample label, and y' represents a sample prediction result;

5) and calculating the position to which the wolf moves next time, circularly traversing each dimension of the wolf, and updating the following formulas (11) - (15):

a＝2-l*((2)/Max_iter)； (11)

X1＝Alpha_pos(j)-(2*a*rand()-a)*abs(2*rand()*Alpha_pos(j)-Position(j))； (12)

X2＝Beta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Beta_pos(j)-Position(j))； (13)

X3＝Delta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Delta_pos(j)-Position(j))； (14)

Position(j)＝(X1+X2+X3)/3； (15)

l is the current iteration number, j represents the jth dimension of the wolf; a is used to control the global and local search capabilities of the algorithm; x1, X2 and X3 are the attraction of α, β, γ wolves to gray wolves, respectively; abs () is an absolute value function;

6) comparing the updated grey wolf position with the alpha wolf, winner1 being the wolf with the best objective function value, loser1 being the wolf with the worst objective function value;

7) update mu (1) and sicma (1), traverse each dimension of the wolf, update as follows (16) - (21):

winner1(j)＝(winner1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (16)

loser1(j)＝(loser1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (17)

mut＝mu(1,j)； (18)

mu(1,j)＝mu(1,j)+(1/200)*(winner1(j)-loser1(j))； (19)

t＝sicma(1,j)^2+mut^2-mu(1,j)^2+(1/200)*(winner1(j)^2-loser1(j)^2)； (20)

8) comparing the updated grey wolf position with the beta wolf, winner2 being the wolf with the best objective function value, loser2 being the wolf with the worst objective function value;

9) update mu (2) and sicma (2), traverse each dimension of the wolf, update as follows (22) - (27):

winner2(j)＝(winner2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (22)

loser2(j)＝(loser2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (23)

mut＝mu(1,j)； (24)

mu(2,j)＝mu(2,j)+(1/200)*(winner2(j)-loser2(j))； (25)

t＝sicma(2,j)^2+mut^2-mu(2,j)^2+(1/200)*(winner2(j)^2-loser2(j)^2)； (26)

10) comparing the updated grey wolf position with the gamma wolf, winner3 being the wolf with the best objective function value, loser3 being the wolf with the worst objective function value;

11) update mu (3) and sicma (3), traverse each dimension of the wolf, update as follows (28) - (33):

winner3(j)＝(winner3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (28)

loser3(j)＝(loser3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (29)

mut＝mu(1,j)； (30)

mu(3,j)＝mu(3,j)+(1/200)*(winner3(j)-loser3(j))； (31)

t＝sicma(3,j)^2+mut^2-mu(3,j)^2+(1/200)*(winner3(j)^2-loser3(j)^2)； (32)

12) and the cycle ends, and outputs the optimum values of winner1, winner2 and winner 3.

Compared with the prior art, the intelligent infant sound box and the interaction method thereof have the following advantages:

the method can dynamically add awakening words, efficiently identify infant voice instructions, intelligently control the authority of infants to access the intelligent sound box, construct an efficient neural network voice training model, optimize neural network parameters in an embedded CPU with limited operation capability by the improved compact wolf algorithm, avoid the problem that the neural network is trapped in a local trap, effectively improve the prediction accuracy and accelerate the prediction process.

Drawings

FIG. 1 is a system diagram of the present invention;

FIG. 2 is a block diagram of an adult speech recognition process of the present invention;

FIG. 3 is a block diagram of a baby speech recognition process according to the present invention;

FIG. 4 is a flow chart of neural network speech recognition training of the present invention;

FIG. 5 is a diagram of a neural network architecture of the present invention;

FIG. 6 is a flow chart of the improved compact wolf algorithm of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:

as shown in fig. 1, the infant intelligent sound box comprises a sound box body, wherein a central processing unit, a memory and a network connector are arranged in the sound box body, and a display screen is arranged on the surface of the sound box body, and is characterized in that a voice acquisition module, an infant voiceprint acquisition module, a wake-up module, an output module and an intelligent control module are arranged in the central processing unit, a storage module is arranged in the memory, the output module is connected with the display screen through a circuit, and the intelligent control module is electrically connected with the voice acquisition module, the infant voiceprint acquisition module, the wake-up module, the storage module and the output module; the voice acquisition module is used for acquiring adult voice information and comprises a plurality of single voice acquisition modules; the infant voiceprint acquisition module is used for acquiring infant voice signals; the awakening module is used for awakening the intelligent sound box through voice and comprises an adult awakening module and an infant awakening module; the storage module is used for storing adult voice recognition information, awakening words, infant common instructions, infant historical browsing information and cache data; the output module is used for responding to a user instruction, and the output content of the output module comprises sound and video; the intelligent control module is used for adult voice recognition, infant voice recognition, user instruction response and dynamic addition of infant awakening words; the network connector is used for connecting the intelligent equipment with the Internet.

In foretell infant's intelligence audio amplifier, a plurality of single pronunciation collection module specifically include first adult administrator pronunciation collection module, second adult administrator pronunciation collection module, third adult administrator pronunciation collection module, fourth adult administrator pronunciation collection module, fifth adult administrator pronunciation collection module and sixth adult administrator pronunciation collection module.

as shown in fig. 2, a method for adult speech recognition:

1) inputting adult sample voice;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting adult training voice;

5) extracting MFCC characteristic parameters;

as shown in fig. 4, 6), performing neural network speech recognition training by the neural network model constructed in step 3), wherein the training method comprises the following steps:

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network; (as shown in FIG. 5)

d. Calling compact wolf algorithm; (as shown in FIG. 6)

e. Setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. predicting and outputting a test result by a neural network;

as shown in fig. 3, B, a method for speech recognition of a baby:

1) inputting a sample voice of the infant;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting a training voice of the infant;

5) extracting MFCC characteristic parameters;

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network; (as shown in FIG. 5)

d. Calling compact wolf algorithm; (as shown in FIG. 6)

e. Setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. and predicting and outputting a test result by the neural network.

As shown in fig. 6, in the interaction of the infant smart speaker, the compact grayish wolf algorithm includes the following steps:

mu＝zeros(3,dim)； (1)

sicma＝10*ones(3,dim)； (2)

Alpha_pos＝ub*generateIndividualR(mu(1),sigma2(1))； (3)

Beta_pos＝ub*generateIndividualR(mu(2),sigma2(2))； (4)

Delta_pos＝ub*generateIndividualR(mu(3),sigma2(3))； (5)

r＝rand()； (6)

erfA＝erf((mu+1)/(sqrt(2)*sigma))； (7)

erfB＝erf((mu-1)/(sqrt(2)*sigma))； (8)

samplerand＝erfinv(-erfA-r*erfB+r*erfA)*sigma*sqrt(2)+mu； (9)

rand () generates a random variable of [0, 1 ]; erf () is an error function, which is the integral of the gaussian probability density function; sqrt () is the square root function; erfiv () represents the inverse error function; samplerand is a function return value;

a＝2-l*((2)/Max_iter)； (11)

X1＝Alpha_pos(j)-(2*a*rand()-a)*abs(2*rand()*Alpha_pos(j)-Position(j))； (12)

X2＝Beta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Beta_pos(j)-Position(j))； (13)

X3＝Delta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Delta_pos(j)-Position(j))； (14)

Position(j)＝(X1+X2+X3)/3； (15)

winner1(j)＝(winner1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (16)

loser1(j)＝(loser1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (17)

mut＝mu(1,j)； (18)

mu(1,j)＝mu(1,j)+(1/200)*(winner1(j)-loser1(j))； (19)

t＝sicma(1,j)^2+mut^2-mu(1,j)^2+(1/200)*(winner1(j)^2-loser1(j)^2)； (20)

winner2(j)＝(winner2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (22)

loser2(j)＝(loser2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (23)

mut＝mu(1,j)； (24)

mu(2,j)＝mu(2,j)+(1/200)*(winner2(j)-loser2(j))； (25)

t＝sicma(2,j)^2+mut^2-mu(2,j)^2+(1/200)*(winner2(j)^2-loser2(j)^2)； (26)

winner3(j)＝(winner3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (28)

loser3(j)＝(loser3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (29)

mut＝mu(1,j)； (30)

mu(3,j)＝mu(3,j)+(1/200)*(winner3(j)-loser3(j))； (31)

t＝sicma(3,j)^2+mut^2-mu(3,j)^2+(1/200)*(winner3(j)^2-loser3(j)^2)； (32)

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. An intelligent infant sound box comprises a sound box body, wherein a central processing unit, a storage and a network connector are arranged in the sound box body, and a display screen is arranged on the surface of the sound box body; the voice acquisition module is used for acquiring adult voice information and comprises a plurality of single voice acquisition modules; the infant voiceprint acquisition module is used for acquiring infant voice signals; the awakening module is used for awakening the intelligent sound box through voice, and comprises an adult awakening module and an infant awakening module; the storage module is used for storing adult voice recognition information, awakening words, infant common instructions, infant historical browsing information and cache data; the output module is used for responding to a user instruction, and the output content of the output module comprises sound and video; the intelligent control module is used for adult voice recognition, infant voice recognition, user instruction response and dynamic addition of infant awakening words; the network connector is used for connecting the intelligent equipment with the Internet.

2. The intelligent sound box for infants as defined in claim 1, wherein the plurality of single voice collecting modules specifically comprises a first adult administrator voice collecting module, a second adult administrator voice collecting module, a third adult administrator voice collecting module, a fourth adult administrator voice collecting module, a fifth adult administrator voice collecting module and a sixth adult administrator voice collecting module.

3. The interaction method of a smart sound box for young children as claimed in claim 1, comprising the following steps:

A. the method for recognizing adult speech comprises the following steps:

1) inputting adult sample voice;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting adult training voice;

5) extracting MFCC characteristic parameters;

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network;

d. calling compact wolf algorithm;

e. setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. predicting and outputting a test result by a neural network;

B. the method for recognizing the infant voice comprises the following steps:

1) inputting a sample voice of the infant;

2) extracting MFCC characteristic parameters;

3) constructing a neural network model;

4) inputting a training voice of the infant;

5) extracting MFCC characteristic parameters;

a. inputting speech characteristic parameter training and testing data;

b. normalizing the training data and the test data;

c. constructing a neural network;

d. calling compact wolf algorithm;

e. setting the neural network parameters as trained parameters;

f. constructing a neural network through the normalized training data;

g. and predicting and outputting a test result by the neural network.

4. The interactive method of intelligent speakers for young children as claimed in claim 3, wherein the compact wolf algorithm comprises the following steps:

mu＝zeros(3,dim)； (1)

sicma＝10*ones(3,dim)； (2)

Alpha_pos＝ub*generateIndividualR(mu(1),sigma2(1))； (3)

Beta_pos＝ub*generateIndividualR(mu(2),sigma2(2))； (4)

Delta_pos＝ub*generateIndividualR(mu(3),sigma2(3))； (5)

r＝rand()； (6)

erfA＝erf((mu+1)/(sqrt(2)*sigma))； (7)

erfB＝erf((mu-1)/(sqrt(2)*sigma))； (8)

samplerand＝erfinv(-erfA-r*erfB+r*erfA)*sigma*sqrt(2)+mu； (9)

a＝2-l*((2)/Max_iter)； (11)

X1＝Alpha_pos(j)-(2*a*rand()-a)*abs(2*rand()*Alpha_pos(j)-Position(j))； (12)

X2＝Beta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Beta_pos(j)-Position(j))； (13)

X3＝Delta_pos(j)-(2*a*rand()-a)*abs(2*rand()*Delta_pos(j)-Position(j))； (14)

Position(j)＝(X1+X2+X3)/3； (15)

winner1(j)＝(winner1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (16)

loser1(j)＝(loser1(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (17)

mut＝mu(1,j)； (18)

mu(1,j)＝mu(1,j)+(1/200)*(winner1(j)-loser1(j))； (19)

t＝sicma(1,j)^2+mut^2-mu(1,j)^2+(1/200)*(winner1(j)^2-loser1(j)^2)； (20)

winner2(j)＝(winner2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (22)

loser2(j)＝(loser2(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (23)

mut＝mu(1,j)； (24)

mu(2,j)＝mu(2,j)+(1/200)*(winner2(j)-loser2(j))； (25)

t＝sicma(2,j)^2+mut^2-mu(2,j)^2+(1/200)*(winner2(j)^2-loser2(j)^2)； (26)

winner3(j)＝(winner3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (28)

loser3(j)＝(loser3(j)-(ub(j)+lb(j))/2)/((ub(j)-lb(j))/2)； (29)

mut＝mu(1,j)； (30)

mu(3,j)＝mu(3,j)+(1/200)*(winner3(j)-loser3(j))； (31)

t＝sicma(3,j)^2+mut^2-mu(3,j)^2+(1/200)*(winner3(j)^2-loser3(j)^2)； (32)