CN114708872A

CN114708872A - Voice instruction response method and device, storage medium and electronic device

Info

Publication number: CN114708872A
Application number: CN202210284357.1A
Authority: CN
Inventors: 骆小菊
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-07-05

Abstract

The invention discloses a voice instruction response method and device, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring a voice instruction of a target object, and determining target voiceprint information of the target object according to the voice instruction; inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object, wherein the target neural network model is used for determining the corresponding estimated age according to the input voiceprint information; determining playing setting with a preset corresponding relation with a target age interval under the condition that the estimated age of the target object is located in a preset target age interval, wherein the playing setting comprises target playing volume; and sending the playing setting to target equipment, and controlling the target equipment to respond to the voice instruction of the target object according to the playing setting. By adopting the technical scheme, the problem that different playing volumes cannot be set for users of different ages is solved.

Description

Voice instruction response method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of communications, and in particular, to a method and an apparatus for responding to a voice command, a storage medium, and an electronic apparatus.

Background

The intelligent sound box is a product of sound box upgrading, is a tool for household consumers to surf the internet by voice, such as song on demand, online shopping or weather forecast knowing, and can also control intelligent sound box equipment, such as opening a curtain, setting the temperature of a refrigerator, raising the temperature of a water heater in advance and the like.

In the practical application scene, the users of different ages can all control smart sound boxes, and include old person, young and middle aged people, child etc. the perception acceptance degree of the user of different ages to the volume is different. If the same volume control strategy is used, a poor user experience is obtained.

For example, an elderly person, who is generally less hearing, may turn the smart speaker to a higher volume; when children operate at the same time period, the volume is large, the hearing of the children can be influenced after long-term use, and the normal use of the old to the intelligent sound box can be influenced if the volume is too high. The existing device does not set different volume for different users.

For the problem that different playing volumes cannot be set for users of different ages in the related art, an effective solution is not provided at present.

Therefore, there is a need for improvement of the related art to overcome the drawbacks of the related art.

Disclosure of Invention

The embodiment of the invention provides a voice instruction response method and device, a storage medium and an electronic device, which at least solve the problem that different playing volumes cannot be set for users of different ages.

According to an aspect of the embodiments of the present invention, there is provided a method for responding to a voice instruction, including: acquiring a voice instruction of a target object, and determining target voiceprint information of the target object according to the voice instruction; inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object; determining playing setting with a preset corresponding relation with the target age interval under the condition that the estimated age of the target object is located in a preset target age interval, wherein the playing setting comprises target playing volume; and sending the playing setting to target equipment, and controlling the target equipment to respond to the voice instruction of the target object according to the playing setting.

According to another aspect of the embodiments of the present invention, there is also provided a device for responding to a voice instruction, including: the first determining module is used for acquiring a voice instruction of a target object and determining target voiceprint information of the target object according to the voice instruction; the second determining module is used for inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object; a third determining module, configured to determine, when the estimated age of the target object is within a preset target age interval, a play setting having a preset correspondence with the target age interval, where the play setting includes a target play volume; and the response module is used for sending the playing setting to target equipment and controlling the target equipment to respond to the voice command of the target object according to the playing setting.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned response method of the voice instruction when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for responding to the voice instruction through the computer program.

According to the invention, under the condition of acquiring the voice command of the target object, the target voiceprint information of the target object is determined according to the voice command, the estimated age of the target object and the target playing volume corresponding to the estimated age are determined according to the target voiceprint information, and the target device is further controlled to respond to the voice command of the target object according to the target playing volume. By adopting the technical scheme, the problem that different playing volumes cannot be set for users of different ages is solved. Furthermore, the age of the user can be determined according to the voice of the user, and the playing volume of the equipment can be determined according to the age, so that the experience of the user is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a computer terminal of a response method of a voice instruction of an embodiment of the present invention;

FIG. 2 is a flow chart of a method of responding to a voice command according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a scenario of a response method of a voice command according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method of responding to a voice command according to an embodiment of the present invention;

FIG. 5 is a block diagram (one) of the structure of a response apparatus for a voice command according to an embodiment of the present invention;

fig. 6 is a block diagram showing the structure of a response apparatus for a voice instruction according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiments provided in the embodiments of the present application may be executed in a computer terminal or a similar computing device. Taking the example of being operated on a computer terminal, fig. 1 is a hardware structure block diagram of the computer terminal of the voice instruction response method according to the embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more processors 102 (only one is shown in fig. 1), wherein the processors 102 may include, but are not limited to, a Microprocessor (MPU) or a Programmable Logic Device (PLD), and a memory 104 for storing data, and in an exemplary embodiment, the computer terminal may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, a computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or more functionality than that shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the response method of the voice instruction in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In order to solve the above problem, in this embodiment, a method for responding to a voice command is provided, where the method for responding to a voice command in the embodiment of the present application includes but is not limited to being applied to a cloud server, and fig. 2 is a flowchart (a) of a method for responding to a voice command in an embodiment of the present invention, where the flowchart includes the following steps:

step S202, acquiring a voice instruction of a target object, and determining target voiceprint information of the target object according to the voice instruction;

step S204, inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object;

it should be noted that the target neural network model is used for determining the corresponding estimated age according to the input voiceprint information;

step S206, determining playing setting with a preset corresponding relation with a target age interval under the condition that the estimated age of the target object is in a preset target age interval, wherein the playing setting comprises target playing volume;

for better understanding, assuming that the obtained estimated age is 15 years, in the (10-20) age interval, the playing volume corresponding to the preset (10-20) age interval is determined as the target playing volume. Assuming that the obtained estimated age is 65 years, in the age interval (60-70), the playing volume corresponding to the preset age interval (60-70) is determined as the target playing volume, it should be noted that, because the physiological characteristics of people of different ages are different, the acceptable volume is also different, and further, the playing volume corresponding to the age interval (10-20) is different from the playing volume corresponding to the age interval (60-70).

And S208, sending the playing setting to target equipment, and controlling the target equipment to respond to the voice command of the target object according to the playing setting.

It should be noted that the target devices of this embodiment include, but are not limited to: intelligent audio amplifier, the equipment that has the speech function. The target object includes, but is not limited to, a user using the target device.

As an optional example, the step S208 may be implemented by: determining a response audio according to semantic information corresponding to the voice instruction; and controlling the target equipment to play the response audio according to the target playing volume.

That is to say, the cloud server may determine semantic information according to the voice instruction, and then search on the internet according to the semantic information, and then determine a response audio, for better understanding, as an optional example, it is assumed that the voice instruction identified by the cloud server is "please play a piece of music", and then the cloud server may determine target music according to the preference of the user, and send the target music to the target device, and control the target device to play the target music through the target play volume.

Through the steps, under the condition that the voice instruction of the target object is obtained, the target voiceprint information of the target object is determined according to the voice instruction, the estimated age of the target object and the target playing volume corresponding to the estimated age are determined according to the target voiceprint information, and then the target device is controlled to respond to the voice instruction of the target object according to the target playing volume. By adopting the technical scheme, the problem that different playing volumes cannot be set for users of different ages is solved. Furthermore, the age of the user can be determined according to the voice of the user, and the playing volume of the equipment can be determined according to the age, so that the experience of the user is improved.

In an exemplary embodiment, after obtaining the target voiceprint information of the target object, it is further required to determine whether the target voiceprint information exists in a preset target voiceprint library.

If the target voiceprint information exists in the target voiceprint library, determining the target playing volume according to a historical operation log corresponding to the target voiceprint information, wherein the historical operation log is used for storing the historical playing volume corresponding to the target voiceprint information.

It should be noted that the target voiceprint information can uniquely identify the user, and the historical play volume stored in the historical operation log is the historical play volume when the reply information is played by the target device for the user, so that the target play volume can be determined through the target voiceprint information.

And if the target voiceprint information does not exist in the target voiceprint library, inputting the target voiceprint information into the target neural network model to obtain the estimated age of the target object.

In an exemplary embodiment, determining the target playing volume according to the history operation log corresponding to the target voiceprint information may be implemented by:

under the condition that the historical playing volume corresponding to the target voiceprint information exists in the historical operation log, determining the target playing volume as the average value of the historical playing volume, or the previous playing volume in the historical playing volume, or the median of the historical playing volume;

and under the condition that the historical playing volume corresponding to the target voiceprint information does not exist in the historical operation log, determining the target playing volume as the preset volume corresponding to the target voiceprint information.

That is, if there is target voiceprint information of the user in the cloud server, but the target voiceprint information may be saved by the user during registration, but the user does not use the device after registration, then when the user uses the device for the first time, because there is no play record, the volume preset by the user is used as the target play volume.

In an exemplary embodiment, since noise may suddenly appear around the user during speaking, and interference of the noise may exist in the obtained audio, so that the voiceprint information obtained by recognition is inaccurate, and in order to improve the recognition accuracy, the target voiceprint information of the target object is determined according to the voice instruction, which may be implemented by: and splitting the voice instruction to obtain a plurality of sub-audio frequencies, determining sub-voiceprint information of the target object according to each sub-audio frequency in the plurality of sub-audio frequencies to obtain a plurality of sub-voiceprint information, and determining the target voiceprint information of the target object according to the plurality of sub-voiceprint information.

That is to say, the noise may not exist all the time, that is, there may be a period of time in the voice command where no noise exists, and then the voiceprint information of the user can be determined in a manner of splitting the voice command and determining the voiceprint of each sub-audio, so that the accuracy of recognition can be improved.

In an exemplary embodiment, determining the pre-estimated age of the target subject may be accomplished by:

step S1: splitting the voice instruction to obtain a plurality of sub-audios; determining a plurality of pieces of sub-voiceprint information corresponding to the target object according to each of the plurality of sub-audios, wherein it needs to be noted that the target voiceprint information includes the plurality of pieces of sub-voiceprint information;

step S2: inputting each piece of sub-voiceprint information in the sub-voiceprint information into the target neural network model respectively, and determining a sub-voiceprint characteristic corresponding to each piece of sub-voiceprint information;

step S3: classifying the sub-voiceprint features through the target neural network, and respectively determining the predicted age corresponding to each sub-voiceprint feature;

step S4: and carrying out weighted summation on the plurality of determined predicted ages to obtain the predicted age of the target object.

That is, the multiple pieces of sub-voiceprint information may be merged, and then the obtained target voiceprint information may be input into the target neural network model to obtain the estimated age. Or respectively inputting the multiple pieces of sub-voiceprint information into the target neural network model to obtain multiple predicted ages, and further performing weighted summation to obtain the predicted ages. By adopting the technical scheme, the identification accuracy can be improved.

In an exemplary embodiment, before inputting the target voiceprint information into the target neural network model, the following steps are further performed:

step S1: acquiring a training sample set;

it should be noted that each training sample in the training sample set includes sample voiceprint information of a sample object and an actual age of the sample object.

Step S2: training an original neural network model to be trained through a training sample set, adjusting parameters in the original neural network model when a loss value between the estimated age of a sample object and the actual age of the sample object does not meet a preset loss condition, and continuing to train the original neural network model; when the loss value between the estimated age of the sample object and the actual age of the sample object meets a preset loss condition, ending the training, and determining the original neural network model when the training is ended as the target neural network model;

it should be noted that the estimated age of the sample object is the age determined by the original neural network model according to the sample voiceprint information in the training sample.

It should be noted that the original neural network model includes, but is not limited to, a convolutional neural network model, a recurrent neural network model, and a deep-confidence network model. Alternatively, a loss value between the estimated age of the sample object and the actual age of the sample object may be calculated by a loss function.

In an exemplary embodiment, in the process of controlling the target device to respond to the voice command of the target object according to the play setting, if a volume adjustment command of the target object is obtained, determining a first volume adjustment amplitude corresponding to the target age interval in response to the volume adjustment command, adjusting the target play volume to a first play volume according to the first volume adjustment amplitude, and further controlling the target device to continue responding according to the first play volume; it should be noted that the volume adjustment instruction is used to adjust the volume of the corresponding response audio of the voice instruction played by the target device.

It should be noted that the first volume adjustment range is different from the default first volume adjustment range, for example, the default volume adjustment range is 1 unit, the corresponding first volume adjustment range is 2 units of volume in the (10-20) age interval, and the corresponding first volume adjustment range is 4 units of volume in the (60-70) age interval.

For example, in the process of playing music by a first user (65 years old) using a target device, if a volume adjustment instruction is issued (volume is adjusted to be a little higher), the target device will increase the volume by 4 units based on the original playing volume. If the second user (15 years old) issues a volume adjusting instruction (the volume is turned up a little) in the process of playing music by using the target device, the target device only increases the volume by 2 units on the basis of the original playing volume.

In an exemplary embodiment, after controlling the target device to respond to the voice command of the target object according to the play setting, if a replay command issued by the target object is obtained, responding to the replay command, and determining a second volume adjustment amplitude corresponding to the target age interval; adjusting the target playing volume to a second playing volume according to the second volume adjusting amplitude; and controlling the target equipment to respond to the voice instruction of the target object again according to the playing setting. It should be noted that the replay instruction is used to instruct the target device to respond to the voice instruction again. The second volume adjustment range is a volume value that needs to be increased on the basis of the last playing volume when the replay instruction is acquired.

For example, when the user (65 years old) issues a replay instruction, the cloud server controls the target device to increase the volume of 4 units based on the last replay volume to replay the corresponding response audio of the voice instruction. Under the condition that a user (11 years old) issues a replay instruction, the cloud server controls the target device to increase the volume of 2 units on the basis of the last replay volume so as to replay the corresponding response audio of the voice instruction.

Optionally, the first volume adjustment amplitude and the second volume adjustment amplitude may be the same or different. The second loudness adjustment magnitude may be the same as or different from a default second loudness adjustment magnitude.

It is to be understood that the above-described embodiments are only a few, but not all, embodiments of the present invention. In order to better understand the response method of the voice command, the following describes the above process with reference to an embodiment, but the following is not intended to limit the technical solution of the embodiment of the present invention, and specifically:

in an alternative embodiment, fig. 3 is a schematic view of a scenario of a response method of a voice command according to an embodiment of the present invention, and specifically, a user using a speaker includes: children, middle-aged people, the elderly, etc.

Fig. 4 is a flowchart (ii) of a response method of a voice command according to an embodiment of the present invention, specifically, having the following steps:

step S402, the intelligent sound box acquires a voice command issued by a user, and the intelligent sound box system acquires voiceprint data (equivalent to the target voiceprint information in the embodiment);

step S404, the intelligent sound box system judges the age grade of the user according to the voiceprint;

assume that user a is determined to be a 60 year old and user B is determined to be a 6 year old.

Step S406, inquiring an age and volume table in the intelligent sound box system to obtain a volume parameter corresponding to the age;

continuing the above example, assume that a 60 year old person typically sets a volume of 50 units and a child under 10 years old typically sets a volume of 25 units for a smart speaker product.

Step S408, the intelligent sound box system adjusts sound box volume parameters.

Continuing with the example above, assume that, when user A uses, the system automatically adjusts the volume to 50 units; when used by user B, the system automatically adjusts the volume to 25 units.

That is to say, the intelligent sound box system has a voiceprint age model (equivalent to the target neural network model in the above embodiment) and an age sound volume table, and when receiving a user operation instruction, the intelligent sound box system obtains the voiceprint of the user, accurately judges the age of the user, and automatically and intelligently adjusts the volume of the intelligent sound box according to the sound volume table.

In addition, according to the technical scheme of the embodiment of the invention, users at different ages can obtain comfortable volume when using the intelligent sound box, so that the user experience of the intelligent sound box is improved, and the intelligence degree and market competitiveness of the product are improved.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a response device of a voice instruction is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the devices described in the embodiments below are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

Fig. 5 is a block diagram (a) of a response apparatus of a voice command according to an embodiment of the present invention, the apparatus including:

the first determining module 52 is configured to obtain a voice instruction of a target object, and determine target voiceprint information of the target object according to the voice instruction;

a second determining module 54, configured to input the target voiceprint information into a target neural network model to obtain an estimated age of the target object;

a third determining module 56, configured to determine, when the estimated age of the target object is within a preset target age interval, a playing setting having a preset corresponding relationship with the target age interval, where the playing setting includes a target playing volume;

a response module 58, configured to send the play setting to a target device, and control the target device to respond to the voice instruction of the target object according to the play setting.

By the device, under the condition that the voice instruction of the target object is obtained, the target voiceprint information of the target object is determined according to the voice instruction, the estimated age of the target object and the target playing volume corresponding to the estimated age are determined according to the target voiceprint information, and then the target device is controlled to respond to the voice instruction of the target object according to the target playing volume. By adopting the technical scheme, the problem that different playing volumes cannot be set for users of different ages is solved. Furthermore, the age of the user can be determined according to the voice of the user, and the playing volume of the equipment can be determined according to the age, so that the experience of the user is improved.

Fig. 6 is a block diagram (ii) of a structure of a response apparatus of a voice instruction according to an embodiment of the present invention, the apparatus including: a fourth determination module 60, and a training module 62.

In an exemplary embodiment, the fourth determining module 60 is configured to determine whether the target voiceprint information exists in a preset target voiceprint library; and under the condition that the target voiceprint information exists in the target voiceprint library, determining the target playing volume according to a historical operation log corresponding to the target voiceprint information, wherein the historical operation log is used for storing the historical playing volume corresponding to the target voiceprint information.

In an exemplary embodiment, the second determining module 54 is configured to, in a case that the target voiceprint information does not exist in the target voiceprint library, input the target voiceprint information into the target neural network model to obtain the estimated age of the target subject.

In an exemplary embodiment, the fourth determining module 60 is further configured to determine, when a history playing volume corresponding to the target voiceprint information exists in the history operation log, the target playing volume as an average value of the history playing volumes, or a previous playing volume in the history playing volumes; and under the condition that the historical playing volume corresponding to the target voiceprint information does not exist in the historical operation log, determining the target playing volume as the preset volume corresponding to the target voiceprint information.

In an exemplary embodiment, the first determining module 52 is further configured to split the voice instruction to obtain a plurality of sub-audios; determining sub-voiceprint information of the target object according to each sub-audio frequency in the plurality of sub-audio frequencies to obtain a plurality of sub-voiceprint information; and fusing the sub-voiceprint information to obtain the target voiceprint information.

In an exemplary embodiment, the first determining module 52 is further configured to split the voice instruction to obtain a plurality of sub-audios; determining a plurality of sub-voiceprint information corresponding to the target object according to each of the plurality of sub-audios, wherein the target voiceprint information comprises the plurality of sub-voiceprint information; the second determining module 54 is further configured to input each piece of sub-voiceprint information in the plurality of pieces of sub-voiceprint information into the target neural network model, and determine a sub-voiceprint feature corresponding to each piece of sub-voiceprint information; classifying the sub-voiceprint features through the target neural network, and respectively determining the predicted age corresponding to each sub-voiceprint feature; and carrying out weighted summation on the plurality of determined predicted ages to obtain the predicted age of the target object.

In an exemplary embodiment, training module 62 is configured to obtain a set of training samples, wherein each training sample in the set of training samples includes sample voiceprint information of a sample object and an actual age of the sample object; training an original neural network model to be trained through the training sample set, adjusting parameters in the original neural network model when a loss value between the estimated age of a sample object and the actual age of the sample object does not meet a preset loss condition, and continuing to train the original neural network model; and when the loss value between the estimated age of the sample object and the actual age of the sample object meets the preset loss condition, ending the training, and determining the original neural network model when the training is ended as the target neural network model, wherein the estimated age of the sample object is the age determined by the original neural network model according to the sample voiceprint information in the training sample.

In an exemplary embodiment, the response module 58 is further configured to determine a response audio according to semantic information corresponding to the voice command; and controlling the target equipment to play the response audio according to the target playing volume.

In an exemplary embodiment, the response module 58 is further configured to obtain a volume adjustment instruction in a process of controlling the target device to respond to the voice instruction of the target object according to the play setting, where the volume adjustment instruction is used to adjust a play volume of the target device; determining a first volume adjustment amplitude corresponding to the target age interval in response to the volume adjustment instruction; adjusting the target playing volume to a first playing volume according to the first volume adjusting amplitude; or after controlling the target device to respond to the voice instruction of the target object according to the playing setting, acquiring a replay instruction, wherein the replay instruction is used for indicating the target device to respond to the voice instruction again; determining a second volume adjustment amplitude corresponding to the target age interval in response to the replay instruction; adjusting the target playing volume to a second playing volume according to the second volume adjusting amplitude; and controlling the target equipment to respond to the voice instruction of the target object again according to the playing setting.

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a voice instruction of a target object, and determining target voiceprint information of the target object according to the voice instruction;

s2, inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object;

s3, determining playing setting with a preset corresponding relation with the target age interval under the condition that the estimated age of the target object is in a preset target age interval, wherein the playing setting comprises target playing volume;

and S4, sending the playing setting to the target equipment, and controlling the target equipment to respond to the voice instruction of the target object according to the playing setting.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for responding to a voice command, comprising:

acquiring a voice instruction of a target object, and determining target voiceprint information of the target object according to the voice instruction;

inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object;

determining playing setting with a preset corresponding relation with the target age interval under the condition that the estimated age of the target object is located in a preset target age interval, wherein the playing setting comprises target playing volume;

and sending the playing setting to target equipment, and controlling the target equipment to respond to the voice instruction of the target object according to the playing setting.

2. The method of claim 1, further comprising:

determining whether the target voiceprint information exists in a preset target voiceprint library or not;

and under the condition that the target voiceprint information exists in the target voiceprint library, determining the target playing volume according to a historical operation log corresponding to the target voiceprint information, wherein the historical operation log is used for storing the historical playing volume corresponding to the target voiceprint information.

3. The method according to claim 2, wherein the determining the target playback volume according to the historical operation log corresponding to the target voiceprint information comprises:

determining the target playing volume as an average value of the historical playing volumes or a previous playing volume in the historical playing volumes under the condition that the historical playing volume corresponding to the target voiceprint information exists in the historical operation log;

and under the condition that the historical playing volume corresponding to the target voiceprint information does not exist in the historical operation log, determining the target playing volume to be equal to the preset volume corresponding to the target voiceprint information.

4. The method according to any one of claims 1 to 3,

determining target voiceprint information of the target object according to the voice instruction, wherein the determining comprises the following steps: splitting the voice command to obtain a plurality of sub-audios; determining a plurality of sub-voiceprint information corresponding to the target object according to each of the plurality of sub-audios, wherein the target voiceprint information comprises the plurality of sub-voiceprint information;

the inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object comprises the following steps:

inputting each piece of sub-voiceprint information in the plurality of pieces of sub-voiceprint information into the target neural network model respectively, and determining a sub-voiceprint characteristic corresponding to each piece of sub-voiceprint information;

classifying the sub-voiceprint features through the target neural network, and respectively determining the predicted age corresponding to each sub-voiceprint feature;

and carrying out weighted summation on the plurality of determined predicted ages to obtain the predicted age of the target object.

5. The method of claim 1, wherein prior to inputting the target voiceprint information into a target neural network model, the method further comprises:

acquiring a training sample set, wherein each training sample in the training sample set comprises sample voiceprint information of a sample object and an actual age of the sample object;

training an original neural network model to be trained through the training sample set, adjusting parameters in the original neural network model when a loss value between the estimated age of a sample object and the actual age of the sample object does not meet a preset loss condition, and continuing to train the original neural network model; and when the loss value between the estimated age of the sample object and the actual age of the sample object meets the preset loss condition, ending the training, and determining the original neural network model when the training is ended as the target neural network model, wherein the estimated age of the sample object is the age determined by the original neural network model according to the sample voiceprint information in the training sample.

6. The method of claim 1, wherein controlling the target device to respond to the voice command of the target object according to the play setting comprises:

determining a response audio according to semantic information corresponding to the voice instruction;

and controlling the target equipment to play the response audio according to the target playing volume.

7. The method of claim 1, further comprising:

in the process of controlling the target device to respond to the voice command of the target object according to the playing setting, acquiring a volume adjusting command, wherein the volume adjusting command is used for adjusting the playing volume of the target device; determining a first volume adjustment amplitude corresponding to the target age interval in response to the volume adjustment instruction; adjusting the target playing volume to a first playing volume according to the first volume adjusting amplitude; or

After controlling the target device to respond to the voice command of the target object according to the playing setting, acquiring a replay command, wherein the replay command is used for indicating the target device to respond to the voice command again; determining a second volume adjustment amplitude corresponding to the target age interval in response to the replay instruction; adjusting the target playing volume to a second playing volume according to the second volume adjusting amplitude; and controlling the target equipment to respond to the voice instruction of the target object again according to the playing setting.

8. A device for responding to a voice command, comprising:

the first determining module is used for acquiring a voice instruction of a target object and determining target voiceprint information of the target object according to the voice instruction;

the second determining module is used for inputting the target voiceprint information into a target neural network model to obtain the estimated age of the target object;

the third determining module is used for determining the playing setting which has a preset corresponding relation with the target age interval under the condition that the estimated age of the target object is located in a preset target age interval, wherein the playing setting comprises target playing volume;

and the response module is used for sending the playing setting to target equipment and controlling the target equipment to respond to the voice command of the target object according to the playing setting.

9. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.