WO2020251088A1

WO2020251088A1 - Sound map generation method and sound recognition method using sound map

Info

Publication number: WO2020251088A1
Application number: PCT/KR2019/007150
Authority: WO
Inventors: 박지환
Original assignee: 엘지전자 주식회사
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2020-12-17

Abstract

Disclosed are a sound map generation method and a sound recognition method using a sound map, which are applied to a robot capable of recognizing sounds. The sound map generation method comprises the steps of: receiving sounds; extracting directional sounds; detecting sound activities from the directional sounds; obtaining characteristic values of sounds present in the detected sound activities; generating sound event IDs including the obtained sound characteristic values; counting the number of times that sounds having the same sound event ID occur; and storing, in a sound map, the sound event IDs for the sounds of which the number of times of occurrence exceeds a set value.

Description

Sound map generation method and sound recognition method using sound map

The embodiment relates to a sound map generation method applied to a robot capable of recognizing sound including a user's voice command and a sound recognition method using the sound map.

The content described in this section merely provides background information on the embodiment and does not constitute the prior art.

With the development of technology, various services to which speech recognition technology are applied are being introduced in many fields recently. Speech recognition technology can be said to be a series of processes that understand human speech and convert it into text information that can be handled by a computer. Speech recognition service using speech recognition technology recognizes the user's voice and is an appropriate service for this. It may include a series of processes to provide.

On the other hand, voice recognition technology is applied to various robots provided for user convenience, and development of technology for performing commands by recognizing a user's voice command is being actively studied.

In a robot that recognizes sound, a technology that accurately distinguishes between a user's voice command and other noise is important.

Korean Patent Registration No. 10-1009854 discloses an apparatus and method for estimating noise contained in an acoustic signal during a process of processing an acoustic signal in relation to speech recognition. However, the prior art does not disclose a solution when a robot confuses a voice command due to a sound similar to a user's voice command generated on a TV or the like.

US Patent Publication No. US 20180350379 A1 discloses a technology for processing multi-channel voice signals input through a plurality of microphones in relation to voice recognition. However, similarly, the prior art does not disclose a solution when a robot confuses a voice command due to a sound similar to a user's voice command generated on a TV or the like.

One problem to be solved in the embodiment is to propose a method for improving the voice recognition function of the robot by allowing the robot to effectively distinguish between a user's voice command and similar noise.

Another object of the embodiment is to present a solution to the problem of not accurately recognizing the user's voice command due to the movement of the robot, particularly, rotation.

Another task of the embodiment is to propose a method for removing a risk to the user in a situation in which the robot recognizes an abnormal sound other than a user's voice command and everyday noise.

The technical problem to be achieved by the embodiment is not limited to the technical problem mentioned above, and other technical problems that are not mentioned will be clearly understood by those of ordinary skill in the technical field to which the embodiment belongs from the following description.

In order to achieve the above-described task, the robot may generate a sound map for generating a sound map in which at least one sound event ID related to noise is stored.

The method of generating a sound map includes receiving a sound, extracting a directional sound, detecting a sound activity from a directional sound, and determining the characteristic value of the sound existing in the detected sound section. Acquiring, generating a sound event ID including the acquired sound characteristic value, counting the number of occurrences of sounds having the same sound event ID, and relating to a sound in which the number of occurrences exceeds a set value. It may include storing the sound event ID in the sound map.

The sound characteristic value may include a direction angle indicating a position of a sound source generating a directional sound as an angle, and a frequency of the directional sound.

In addition, the sound characteristic value may further include at least one of an amplitude, a sound pressure, and a tone color of a directional sound.

The sound map may store a plurality of sound event IDs having different sound characteristic values.

In order to achieve the above-described task, the robot can recognize the sound including the user's voice command using the sound map.

One embodiment of the sound recognition method includes receiving a sound including a voice command, generating a steering vector in a noise direction using a sound map, and forming a sound beam in the direction in which the steering vector is generated ( receiving noise by performing beam forming), calculating a power spectral density matrix of the input noise, calculating a PSD matrix of sound including voice commands, and sound including voice commands It may include the step of deriving the PSD matrix of the voice command by using the difference value between the PSD matrix of the input noise and the PSD matrix of the input noise as the PSD matrix of the voice command.

Another embodiment of the sound recognition method is a sound event ID comparison step of comparing a first sound event ID generated in the step of generating a sound event ID and a previously generated and stored second sound event ID, and the first sound event ID If the second sound event ID does not match, an abnormal sound detection step of determining that the abnormal sound has been detected, and an alarm step of notifying a user when the abnormal sound is detected.

Another embodiment of the sound recognition method includes receiving a sound, extracting a directional sound, detecting a sound activity from a directional sound, and detecting the sound existing in the detected sound section. Acquiring a characteristic value, generating a sound event ID including the obtained sound characteristic value, counting the number of occurrences of a sound with the same sound event ID, a sound in which the number of occurrences exceeds a set value Storing a sound event ID for a sound map, receiving a sound including a voice command, generating a steering vector in the noise direction using the sound map, in the direction in which the steering vector was generated Performing sound beam forming to receive noise, calculating a power spectral density matrix of the input noise, calculating a PSD matrix of sound including voice commands, and voice commands It may include a step of deriving the PSD matrix of the voice command using the difference value between the PSD matrix of the sound including the PSD matrix of the sound and the PSD matrix of the input noise as the PSD matrix of the voice command.

Another embodiment of the sound recognition method includes receiving a sound, extracting a directional sound, detecting a sound section from a directional sound, and acquiring a characteristic value of the sound existing in the detected sound section. The steps of, generating a sound event ID including the obtained sound characteristic value, counting the number of occurrences of sounds having the same sound event ID, and calculating a sound event ID for a sound whose occurrence number exceeds a set value. The sound event ID comparison step of comparing the first sound event ID generated in the step of storing the sound map and the step of generating the sound event ID with the previously generated second sound event ID, and the first sound event ID is the second If the sound event ID does not match, the abnormal sound detection step of determining that the abnormal sound has been detected, and an alarm step of notifying the user when the abnormal sound is detected.

In the embodiment, by generating a sound map and effectively removing noise input to the robot using the generated sound map, the robot can more clearly recognize the user's voice command.

In the embodiment, by recording characteristic values of everyday sounds that may cause confusion with a user's voice command in a sound map, and removing such everyday sounds using a sound map, malfunction of the robot can be effectively suppressed.

In an embodiment, it is possible to determine the direction of noise generation regardless of the rotation of the head of the robot, and to generate a steering vector for beamforming.

In the embodiment, the robot easily grasps whether an abnormal sound has occurred using a sound map, and the robot notifies the user of the occurrence of the abnormal sound, so that the user can quickly respond to an unexpected or emergency situation.

1 is a diagram illustrating a robot that recognizes sound according to an exemplary embodiment.

2 is a flowchart illustrating a method of generating a sound map according to an exemplary embodiment.

3 to 5 are views for explaining the operation of the robot according to an embodiment.

6 to 8 are flowcharts illustrating a method of recognizing a sound using a sound map according to an exemplary embodiment.

9 is a flowchart illustrating a method for recognizing a sound using a sound map according to another exemplary embodiment.

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Since the embodiments can be modified in various ways and have various forms, specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the embodiments to a specific form of disclosure, and it should be understood that all changes, equivalents, and substitutes included in the spirit and scope of the embodiments are included.

Terms such as "first" and "second" may be used to describe various elements, but the elements should not be limited by the terms. The terms are used for the purpose of distinguishing one component from another component. In addition, terms specifically defined in consideration of the configuration and operation of the embodiment are only for describing the embodiment, and do not limit the scope of the embodiment.

In the description of the embodiment, in the case of being described as being formed on the "top (top)" or "bottom (on or under)" of each element, the top (top) or bottom (bottom) (on or under) ) Includes both elements in which two elements are in direct contact with each other or in which one or more other elements are indirectly formed between the two elements. In addition, when expressed as “up (up)” or “on or under”, the meaning of not only an upward direction but also a downward direction based on one element may be included.

In addition, relational terms such as "top/top/top" and "bottom/bottom/bottom" used below do not necessarily require or imply any physical or logical relationship or order between such entities or elements, It may be used to distinguish one entity or element from another entity or element.

1 is a diagram illustrating a robot 100 that recognizes sound according to an exemplary embodiment. The robot 100 may have a voice recognition function capable of performing a command requested by the user by recognizing the user's voice. The robot 100 can be used for home, industrial and other various purposes.

The robot 100 may include a head unit 110, a body 120, and a display unit 130. The display unit 130 displays an image necessary for the user and the robot 100 to interact, for example, an operation mode, an operation state, an error state, other necessary images, a happy state, a depressed state, etc. I can. The display unit 130 may be integrally combined with the head unit 110 and may move together according to the movement of the head unit 110.

The robot 100 may include a rotatable head unit 110. Referring to FIG. 1, for example, the head unit 110 is coupled to the body 120 in a pivot form to allow a three-dimensional movement with respect to the body 120.

That is, as shown in (b) of Figure 1, the head portion 110 is rotatable on the xy plane, and as shown in (c) of Figure 1, the head portion 110 is on the zx plane It can also be provided to enable rotation. Accordingly, the head unit 110 can actively operate in three dimensions with respect to the body 120.

Meanwhile, the robot 100 includes a communication unit connected to the server, so that the robot 100 and the server may exchange information with each other. In addition, the robot 100 may include an operation unit and a control unit necessary to perform operations and operations required in each step described in the embodiment, and may include a storage unit for storing necessary information.

In addition, the robot 100 may include a multi-channel microphone and a speaker. The multi-channel microphone can receive sound from at least two different channels, respectively. Each channel may include at least one independent microphone, and such a plurality of microphones may be provided at a predetermined distance from each other.

Each channel receives the same sound independently, and as a human receives the same sound independently from two ears and recognizes the difference between the received sound and recognizes the direction in which the sound is generated, the robot 100 It is possible to determine the direction in which the sound is generated by grasping the difference between the same sound input through the channels.

The speaker can output the sound required by the user. The robot 100 may transmit a voice or an alarm to a user through a speaker or may play music to the user.

The multi-channel microphone and speaker may be placed in an appropriate position on the robot 100. In an embodiment, the multi-channel microphone may be provided on the head portion 110 of the robot 100. In this case, since the head unit 110 rotates, it may be provided to rotate together with the rotation of the head unit 110 to operate in three dimensions.

The robot 100 includes a mobile communication module, and may communicate with a server, a user terminal held by a user, and the like through the mobile communication module. Here, the mobile communication module includes technical standards or communication methods for mobile communication (for example, GSM (Global System for Mobile Communication), CDMA (Code Division Multi Access), CDMA2000 (Code Division Multi Access 2000)), EV-DO (Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), LTE (Long Term Evolution), LTE-A (Long Term Evolution-Advanced), etc.) and 5G (Generation) communication.

Similarly, the robot 100, the server, and the user terminal may also include the 5G communication module described above. In this case, since the robot 100, the server, and the user terminal can transmit data at a speed of 100 Mbps to 20 Gbps, a large amount of voice or image data can be transmitted very quickly. Accordingly, the server and the user terminal can more accurately recognize a large amount of voice or image data transmitted from the robot 100 more quickly.

The robot 100, server, and user terminal equipped with a 5G communication module can support various intelligent communication of things (Internet of Things (IoT), Internet of Everything (IoE), Internet of Small Things (IoST), etc.)), and the robot The 100 may support machine to machine (M2M) communication, vehicle to everything communication (V2X) communication, and device to device (D2D) communication. Accordingly, the robot 100 can very efficiently share information that can be acquired on various devices and spaces with each other.

The robot 100 may perform machine learning such as deep learning for an input user's voice command, and may store data and result data used for machine learning.

Machine learning is a branch of artificial intelligence, which can include a field of research that gives computers the ability to learn without explicit programming. Specifically, machine learning can be said to be a technology that studies and builds a system that learns based on empirical data, performs prediction, and improves its own performance, and algorithms for it. Machine learning algorithms can take a way to build specific models to derive predictions or decisions based on input data, rather than executing strictly defined static program instructions. The term'machine learning' can be used interchangeably with the term'machine learning'.

In terms of how to classify data in machine learning, many machine learning algorithms have been developed. It may include a decision tree, a Bayesian network, a support vector machine (SVM), and an artificial neural network (ANN).

The decision tree may include an analysis method that performs classification and prediction by charting decision rules into a tree structure.

The Bayesian network may include a model that expresses a probabilistic relationship (conditional independence) between multiple variables in a graph structure. Bayesian networks may be suitable for data mining through unsupervised learning.

The support vector machine is a model of supervised learning for pattern recognition and data analysis, and can be used mainly for classification and regression analysis.

Meanwhile, the robot 100 may be equipped with an artificial neural network, and may perform machine learning-based user recognition and user voice recognition using a received voice input signal as input data.

An artificial neural network is a model of the principle of operation of biological neurons and the connection relationship between neurons, and can include an information processing system in which a number of neurons called nodes or processing elements are connected in the form of a layer structure. have. Artificial neural networks are models used in machine learning and can include statistical learning algorithms inspired by biological neural networks (especially the brain among animals' central nervous systems) in machine learning and cognitive science. Specifically, the artificial neural network may refer to an overall model having problem-solving ability by changing the strength of synaptic bonding through learning by artificial neurons (nodes) that form a network by combining synapses.

The term artificial neural network may be used interchangeably with the term neural network.

The artificial neural network may include a plurality of layers, and each of the layers may include a plurality of neurons. In addition, artificial neural networks may include synapses that connect neurons and neurons.

Artificial neural networks generally have three factors: (1) the connection pattern between neurons in different layers (2) the learning process to update the weight of the connection (3) the output value from the weighted sum of the input received from the previous layer It can be defined by the activation function it creates.

The robot 100 is an artificial neural network, for example, in a manner such as DNN (Deep Neural Network), RNN (Recurrent Neural Network), BRDNN (Bidirectional Recurrent Deep Neural Network), MLP (Multilayer Perceptron), CNN (Convolutional Neural Network). Network models of may be included, but are not limited thereto.

In the present specification, the term'layer' may be used interchangeably with the term'layer'.

Artificial neural networks can be classified into Single-Layer Neural Networks and Multi-Layer Neural Networks according to the number of layers.

A general single-layer neural network can be composed of an input layer and an output layer. In addition, a general multilayer neural network may include an input layer, one or more hidden layers, and an output layer.

The input layer is a layer that receives external data, and the number of neurons in the input layer is the same as the number of input variables, and the hidden layer is located between the input layer and the output layer, receives signals from the input layer, extracts features, and transfers them to the output layer. I can. The output layer receives a signal from the hidden layer and outputs an output value based on the received signal. The input signal between neurons is multiplied by each connection strength (weight) and then summed. If the sum is greater than the neuron's threshold, the neuron is activated and the output value obtained through the activation function can be output.

Meanwhile, a deep neural network including a plurality of hidden layers between an input layer and an output layer may be a representative artificial neural network that implements deep learning, a type of machine learning technology.

Meanwhile, the term'deep learning' can be used interchangeably with the term'deep learning'.

The artificial neural network can be trained using training data. Here, learning means a process of determining parameters of an artificial neural network using training data in order to achieve the purpose of classifying, regressing, or clustering input data. I can. Representative examples of parameters of an artificial neural network include weights applied to synapses or biases applied to neurons.

The artificial neural network learned by the training data may classify or cluster input data according to patterns of the input data.

Meanwhile, an artificial neural network trained using training data may be referred to as a trained model in this specification.

Next, a learning method of an artificial neural network performed by the robot 100 will be described.

Learning methods of artificial neural networks can be classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

Supervised learning may include a method of machine learning to infer a function from training data.

And among the functions to be inferred, outputting a continuous value is called regression, and predicting and outputting the class of an input vector can be called classification.

In supervised learning, an artificial neural network can be trained while a label for training data is given.

Here, the label may mean a correct answer (or result value) that the artificial neural network must infer when training data is input to the artificial neural network.

In this specification, when training data is input, the correct answer (or result value) to be inferred by the artificial neural network may be referred to as a label or labeling data.

In addition, in the present specification, setting a label on training data for learning an artificial neural network may be referred to as labeling the training data with labeling data.

In this case, the training data and the label corresponding to the training data) constitute one training set, and may be input to the artificial neural network in the form of a training set.

Meanwhile, the training data represents a plurality of features, and labeling of the training data may mean that a label is attached to the feature represented by the training data. In this case, the training data may represent the characteristics of the input object in the form of a vector.

The artificial neural network can infer a function for the correlation between the training data and the labeling data using the training data and the labeling data. In addition, parameters of the artificial neural network may be determined (optimized) through evaluation of a function inferred from the artificial neural network.

Unsupervised learning is a kind of machine learning, and labels for training data may not be given.

Specifically, the unsupervised learning may be a learning method for training an artificial neural network to find and classify patterns in the training data itself, rather than an association relationship between training data and a label corresponding to the training data.

Examples of unsupervised learning include clustering or independent component analysis.

In the present specification, the term'clustering' may be used interchangeably with the term'clustering'.

Examples of artificial neural networks using unsupervised learning include a generative adversarial network (GAN) and an autoencoder (AE).

The generative adversarial neural network may include a generator and a discriminator, a machine learning method in which two different artificial intelligences compete and improve performance.

In this case, the generator is a model that creates new data and can create new data based on the original data.

Also, the discriminator is a model that recognizes a pattern of data, and may play a role of discriminating whether the input data is original data or new data generated by the generator.

In addition, the generator learns by receiving data that cannot be deceived by the discriminator, and the discriminator can learn by receiving deceived data from the generator. Accordingly, the generator can evolve to deceive the discriminator as well as possible, and the discriminator can evolve to distinguish between the original data and the data generated by the generator.

Auto encoders can include neural networks aiming to reproduce the input itself as an output.

The auto encoder may include an input layer, at least one hidden layer, and an output layer. In this case, since the number of nodes in the hidden layer is smaller than the number of nodes in the input layer, the dimension of the data is reduced, and compression or encoding may be performed accordingly.

In addition, data output from the hidden layer can enter the output layer. In this case, since the number of nodes in the output layer is greater than the number of nodes in the hidden layer, the dimension of the data increases, and accordingly, decompression or decoding may be performed.

Meanwhile, the auto-encoder adjusts the connection strength of neurons through learning, so that input data can be expressed as hidden layer data. In the hidden layer, information is expressed with fewer neurons than in the input layer, but being able to reproduce the input data as an output may mean that the hidden layer found and expressed a hidden pattern from the input data.

Semi-supervised learning is a kind of machine learning, and may mean a learning method using both labeled training data and unlabeled training data.

As one of the techniques of semi-supervised learning, there is a technique of inferring a label of training data that is not given a label and then performing learning using the inferred label. This technique is useful when the cost of labeling is high. I can.

Reinforcement learning may include the theory that given an environment in which an agent can judge what action to do at every moment, it can find the best way to experience without data.

Reinforcement learning can be mainly performed by the Markov Decision Process (MDP).

Explaining the Markov decision process, first, an environment is given where the information necessary for the agent to perform the next action is given, second, it defines how the agent will behave in that environment, and third, if the agent does something well, it is rewarded ( Reward) is given and the penalty is given for failing to do something, and fourth, it is possible to derive the optimal policy by repeatedly experiencing it until the future reward reaches its peak.

The structure of the artificial neural network is specified by the configuration of the model, activation function, loss function or cost function, learning algorithm, optimization algorithm, etc., and hyperparameters are pre-trained. It is set, and then, a model parameter is set through learning, so that the content can be specified.

For example, factors determining the structure of an artificial neural network may include the number of hidden layers, the number of hidden nodes included in each hidden layer, an input feature vector, a target feature vector, and the like.

Hyperparameters may include several parameters that must be initially set for learning, such as initial values of model parameters. And, the model parameter may include several parameters to be determined through learning.

For example, the hyperparameter may include an initial weight value between nodes, an initial bias value between nodes, a mini-batch size, a number of learning iterations, and a learning rate. In addition, the model parameters may include weights between nodes, biases between nodes, and the like.

The loss function can be used as an index (reference) for determining an optimal model parameter in the learning process of the artificial neural network. In artificial neural networks, learning refers to the process of manipulating model parameters to reduce the loss function, and the purpose of learning can be seen as determining model parameters that minimize the loss function.

The loss function may mainly use a mean squared error (MSE) or a cross entropy error (CEE), but the present invention is not limited thereto.

The cross entropy error may be used when the correct answer label is one-hot encoded. One-hot encoding may include an encoding method in which the correct answer label value is set to 1 for only neurons corresponding to the correct answer, and the correct answer label value is set to 0 for neurons that are not correct answers.

In machine learning or deep learning, a learning optimization algorithm can be used to minimize the loss function, and the learning optimization algorithm is gradient descent (GD), stochastic gradient descent (SGD), and momentum. ), NAG (Nesterov Accelerate Gradient), Adagrad, AdaDelta, RMSProp, Adam, Nadam, and the like.

The gradient descent method may include a technique of adjusting a model parameter in a direction to reduce a loss function value by considering the slope of the loss function in the current state.

The direction of adjusting the model parameters may be referred to as a step direction, and the adjusting size may be referred to as a step size.

In this case, the step size may mean a learning rate.

In the gradient descent method, a gradient is obtained by partially differentiating a loss function into each model parameter, and the model parameters may be updated by changing the acquired gradient direction by a learning rate.

The probabilistic gradient descent method may include a technique in which the frequency of gradient descent is increased by dividing the training data into mini-batch and performing gradient descent for each mini-batch.

Adagrad, AdaDelta, and RMSProp may include techniques for increasing optimization accuracy by adjusting the step size in SGD. In SGD, momentum and NAG may include a technique to increase optimization accuracy by adjusting the step direction. Adam can include a technique that increases optimization accuracy by adjusting the step size and step direction by combining momentum and RMSProp. Nadam may include a technique to increase optimization accuracy by adjusting step size and step direction by combining NAG and RMSProp.

The learning speed and accuracy of the artificial neural network may include features that are highly dependent on hyperparameters, as well as the structure of the artificial neural network and the type of learning optimization algorithm. Therefore, in order to obtain a good learning model, it may be important not only to determine an appropriate artificial neural network structure and learning algorithm, but also to set appropriate hyperparameters.

Typically, the hyperparameter can be set to an optimal value that provides a stable learning speed and accuracy by training an artificial neural network by experimentally setting various values.

The sound that the robot 100 recognizes with a multi-channel microphone may include a user's voice including a starting word and other noise. The robot 100 may perform a command by recognizing a user's voice, and may notify the user when an abnormal sound is recognized.

Here, the abnormal sound refers to a sound that is not normally heard in a space in which the robot 100 resides, such as a voice of a third party other than a user or a sound generated when an object is damaged.

When an abnormal sound is generated, the intrusion of a third party or occurrence of an accident is suspected, and the robot 100 notifies the user so that the user can cope with this situation.

When the robot 100 recognizes the user's voice and receives a voice similar to the user's voice, there is a high risk of malfunction. For example, the sound output from the TV 10-2 includes a human voice, and the robot 100 may confuse the human voice with the user's voice, and the robot 100 may malfunction.

Therefore, in the embodiment, information about noise that is normally heard in the space where the robot 100 resides is prepared in advance, and in consideration of information about the noise previously created in the sound received by the robot 100, the user's voice command is It provides a way to accurately recognize and recognize abnormal sounds.

Information on noise may be created as a sound map and stored in the robot 100 or a server connected to the robot 100. The sound map may store at least one sound event ID related to noise that occurs routinely in a space where the robot 100 resides. Hereinafter, a method of generating a sound map will be described.

2 is a flowchart illustrating a method of generating a sound map according to an exemplary embodiment. The sound map generation method may have the following steps.

The robot 100 may receive ambient sound (S110). In this case, sound may be input to a multi-channel microphone provided in the robot 100 that recognizes the sound.

The robot 100 may extract a directional sound from the input sound (S120). Since the multi-channel microphone independently receives the same sound from each channel, the robot 100 can extract a directional sound by grasping a difference in the same sound input through each channel.

For example, since the microphones connected to each channel of the multi-channel microphone are placed in a position spaced apart from each other, the robot 100 has a difference in arrival time of a signal introduced into each microphone, a structural difference in a sound waveform, a frequency response, etc. By analyzing the sound, you can extract directional sound. Meanwhile, in the same way, the robot 100 may determine the direction of a sound source that generates a directional sound input to a multi-channel microphone.

Since the sound without directionality cannot know the directional angle included in the sound map to be described below, the sound recognition performance of the robot 100 based on the sound map may be degraded, so it is appropriate not to record the sound map.

In the step of extracting the directional sound, the directional sound may be extracted by removing the sound without directional from the input sound by spatial filtering. For example, spatial filtering of sound may be performed by removing a spatial frequency band in which non-directional sound is distributed.

The robot 100 may detect a sound activity from a directional sound (S130). In this case, the sound section may mean a section in which a sound exists in a continuous sound input.

The robot 100 may acquire a characteristic value of the sound existing in the detected sound section (S140). In this case, the sound characteristic value may include a direction angle and a frequency, and may further include amplitude, sound pressure, tone, and the like.

As described above, since only directional sound information is recorded in the sound map, the sound characteristic values may be the frequency, amplitude, sound pressure, and tone of the directional sound. The robot 100 may distinguish different sounds based on sound characteristic values and clearly distinguish input sounds based on this.

The directional angle may be an angle indicating a position of a sound source generating the directional sound. Hereinafter, the direction angle will be described with reference to FIGS. 3 to 5. 3 to 5 are views for explaining the operation of the robot 100 according to an embodiment.

For example, it is assumed that the robot 100 is placed on a table in front of a sofa where the user mainly stays in the user's residence. A sound source that generates everyday sounds may be disposed around the robot 100.

For example, it is assumed that the sound sources are speaker 1 (10-1), TV (10-2), speaker 2 (10-3), and air purifier (10-4). At this time, since the sound sources of the speaker 1 (10-1), TV (10-2), and speaker 2 (10-3) can generate a sound similar to the user's voice command, the robot 100 Due to this confusion, a malfunction of the robot 100 may occur. This sound is recorded in the sound map, and the robot 100 can improve the speech recognition function by removing the sound when receiving a voice input. This will be described in detail below.

The daily sounds generated by the speaker 1 (10-1), TV (10-2), speaker 2 (10-3), and air purifier (10-4) are recorded in the sound map, and the robot 100 An abnormal sound can be detected among input sounds based on the map. This will be described in detail below.

As described above, the robot 100 may determine the direction of sound input to a multi-channel microphone. Of course, sound without directionality can be eliminated by spatial filtering. The direction of the sound can be derived from the direction angle and stored in the sound event ID and sound map.

Referring to FIG. 3, the direction angle is an angle indicating the position of the sound source from the reference line L1. For example, the TV 10-2, the speaker 2 10-3, and the air purifier 10-4 may have a1, a2, and a3 direction angles, respectively, and the speaker may have a 4 direction angle. At this time, the direction angles a1 to a4 may be measured by a multi-channel microphone, as described above.

Meanwhile, when the head unit 110 of the robot 100 rotates, the multi-channel microphone may also rotate. When the multi-channel microphone is rotated, the arrangement angle of the multi-channel microphone with respect to the sound source is different, so the direction angle of the sound source measured by the multi-channel microphone may be different from the direction angle measured by the reference line L1.

In this case, the direction angle may be derived by compensating the angle at which the head unit 110 rotates with the angle measured by the multi-channel microphone, that is, the angle with respect to the sound source measured by the multi-channel microphone.

Referring to FIG. 4, the head unit 110 and the multi-channel microphone are rotated counterclockwise by b1 with respect to the reference line L1. In this case, the angle b1 rotated by the head unit 110 and the angle b2 measured by the multi-channel microphone may be summed to derive the direction angle a3 for the air purifier 10-4.

Referring to FIG. 5, the head unit 110 and the multi-channel microphone are rotated by c1 clockwise with respect to the reference line L1. In this case, the direction angle a3 for the air purifier 10-4 can be derived by subtracting the angle c1 rotated by the head unit 110 from the angle c2 measured by the multi-channel microphone.

By compensating the angle at which the head unit 110 rotates in the above-described manner, it is possible to derive the direction angle of each sound source existing at a certain position regardless of the rotation of the head unit 110.

When the head part 110 of the robot 100 rotates on the z-x plane, it may rotate together with the multi-channel microphone. Even in this case, the direction angle may be derived by compensating the angle at which the head unit 110 rotates in the same way as the case where the head unit 110 rotates on the x-y plane.

The robot 100 may generate a sound event ID including the acquired sound characteristic value (S150). The sound event ID is a record including the sound characteristic value, and an independent sound event ID may be generated for each sound of a different type.

The robot 100 may count the number of occurrences of sounds having the same sound event ID (S160). The purpose is to remove abnormal sounds that are not routine and non-continuous among sounds received by the robot 100 because they do not need to be registered in the sound map. The robot 100 records the counted number of times in the sound event ID, and thus the number of occurrences of the sound may be recorded in the sound event ID.

The robot 100 may store a sound event ID for a sound in which the number of occurrences exceeds a set value in the sound map (S170). As a result, information of everyday sounds can be stored in the sound map.

The sound map stores a plurality of sound event IDs having different sound characteristic values, and the generated sound map may be stored in the robot 100 or a server connected to the robot 100.

By continuously generating a new sound map by the above-described method, the sound map may be updated to include a newly added everyday sound, and to remove a previous everyday sound.

The placement position of the robot 100 can be moved by a user. In addition, in the case of the robot 100 capable of traveling by itself, the position may be moved by traveling. Even in this case, the robot 100 may update the sound map by continuously generating a new sound map and replacing the existing sound map.

6 to 8 are flowcharts illustrating a method of recognizing a sound using a sound map according to an exemplary embodiment. Hereinafter, a method of more accurately recognizing a user's voice command by using the sound map generated by the above-described method will be described.

Referring to FIG. 6, the robot 100 may receive a sound including a voice command (S210). The sound recognized by the robot 100 in the above step may include both a user's voice command and a daily sound registered in the sound map.

The robot 100 may recognize a daily sound, that is, a sound currently generated from a sound source that generates noise, according to information on the daily sound registered in the sound map. To this end, the robot 100 may generate a steering vector in a noise direction using the previously registered sound map (S220).

In the step of forming the steering vector (S220), the robot 100 may acquire information on the direction of noise from a sound map previously generated and stored, and may generate a steering vector in the obtained noise direction. In this case, the robot 100 may read sound characteristic values such as a direction angle recorded in the sound map and generate a steering vector in the noise direction according to this. The steering vector may include information on an angle from the reference line L1 to a noise direction.

As described above, when the head unit 110 rotates, the multi-channel microphone may also rotate. Accordingly, the steering vector may be derived by compensating the angle at which the head unit 110 rotates at the direction angle.

That is, a value obtained by compensating the angle at which the head unit 110 rotates in the direction angle recorded in the pre-registered sound map may be used as information on the angle of the steering vector. This angle compensation method is as described above with reference to FIGS. 3 to 5.

The robot 100 may receive noise by performing sound beam forming in the direction in which the steering vector is generated (S230). The robot 100 beamforms in the direction of the steering vector, so that noise generated in the direction recorded in the sound map can be input more clearly and efficiently.

The robot 100 may selectively recognize a sound input from a specific direction indicated by the steering vector through beamforming. In beamforming, for example, by giving different input values to each of the microphones provided in the multi-channel microphone, noise generated in the direction of the steering vector can be more accurately input.

Referring to FIG. 7, in the step S230 of receiving noise by performing the sound beamforming, the following steps may be performed. The robot 100 may receive a sound generated in a direction indicated by the steering vector (S231).

The robot 100 may remove a human voice that is not registered in a previously generated and stored sound map (S232). Such human voice may be a voice command by a user in a direction in which noise exists. Subsequently, the step S232 is necessary to prevent the user's voice command from being removed in the step of removing the received sound (S260).

The robot 100 may calculate a power spectral density matrix of noise input through beamforming (S240). The robot 100 may determine the spatial arrangement and distribution state of the input noise through the calculated PSD matrix.

The robot 100 may calculate a PSD matrix of sound including the received voice command (S250). In the PSD matrix calculated in the above step, a user's voice command and daily noise may exist together.

The robot 100 may derive the PSD matrix of the voice command by using the difference value between the PSD matrix of the sound including the voice command and the PSD matrix of the input noise as the PSD matrix of the voice command (S260). That is, a more accurate user's voice command can be derived by removing the noise acquired through beamforming from the sound in which the user's voice command and ordinary noise exist together. The robot 100 may recognize and execute a command from the PSD matrix of the user's voice command derived by the above method.

Referring to FIG. 8, the robot 100 may calculate a power spectral density (PSD) of an input noise (S233). The robot 100 may calculate a PSD matrix of the input noise from a plurality of PSDs of the input noise.

A plurality of sound sources generating noise may be provided in a space in which the robot 100 resides. Therefore, the robot 100 calculates each PSD for noises input from each sound source, derives a plurality of calculated PSDs as a single PSD matrix, and sums up the plurality of noises in spatial arrangement and distribution. Etc. can be grasped.

In the embodiment, by generating a sound map and effectively removing noise input to the robot 100 using the generated sound map, the robot 100 can more clearly recognize the user's voice command.

In the embodiment, by recording the characteristic values of ordinary sounds that may cause confusion with the user's voice commands in the sound map, and removing such ordinary sounds using the sound map, malfunction of the robot 100 can be effectively suppressed. have.

In an embodiment, it is possible to determine the direction of noise generation irrespective of the rotation of the head unit 110 of the robot 100, and to generate a steering vector for beamforming.

The robot 100 uses a pre-registered sound map to generate abnormal sounds, for example, voices from a third party other than the user, sounds generated when an object is damaged, etc. It recognizes, and when an abnormal sound occurs, it can be notified to the user so that the user can prepare.

Hereinafter, a sound recognition method using a sound map in which the robot 100 detects an abnormal sound using a sound map will be described. 9 is a flowchart illustrating a method for recognizing a sound using a sound map according to another exemplary embodiment. The sound recognition method according to the embodiment may include a sound event ID comparison step, an abnormal sound detection step, and an alarm step.

In the sound event ID comparison step, the robot 100 may compare the first sound event ID generated in the step of generating the sound event ID (S150) and the previously generated and stored second sound event ID (S310 ). At this time, the second sound event ID is registered in the sound map.

In the sound event ID comparison step, it is determined whether the first sound event ID and the second sound event ID match each other, and sound characteristic values held by each event ID are compared with each other.

For example, the robot 100 displays a direction angle indicating a position of a sound source generating a directional sound held by the first sound event ID and the second sound event ID as an angle, and the frequency of the directional sound. By comparing with each other, it is possible to determine whether each sound event ID coincides with each other.

Additionally, the robot 100 compares at least one of the amplitude, sound pressure, and tone of the directional sound held by the first sound event ID and the second sound event ID to determine whether each sound event ID matches each other. Can judge.

In the abnormal sound detection step, when the first sound event ID does not match the second sound event ID, it may be determined that the abnormal sound has been detected (S320). Of course, when the sound event IDs coincide with each other or are very similar to each other within a set range, the robot 100 may determine that an abnormal sound has not been generated.

In the alarm step, when the abnormal sound is detected, the user is notified, so that the user can prepare for a situation in which the abnormal sound has occurred and take immediate action (S330). The alarm can use various methods, such as a method of reproducing or lighting an image on the display unit 130, and generating a warning sound.

In the embodiment, the robot 100 easily grasps whether an abnormal sound has occurred using a sound map, and the robot 100 notifies the user of the occurrence of an abnormal sound, so that the user can quickly respond to an unexpected situation or an emergency situation. .

The above-described embodiments may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium is a magnetic medium such as a hard disk, a floppy disk, and a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, a magneto-optical medium such as a floptical disk, and a ROM. A hardware device specially configured to store and execute program instructions, such as, RAM, flash memory, and the like.

Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and usable to those skilled in the computer software field. Examples of the computer program may include not only machine language codes produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like.

As described above with respect to the embodiment, only a few are described, but other various types of implementation are possible. The technical contents of the above-described embodiments may be combined in various forms, unless they are technologies incompatible with each other, and may be implemented as a new embodiment through this.

Claims

In the sound map generation method for generating a sound map in which at least one sound event ID related to noise is stored,

Receiving a sound input;

Extracting directional sound;

Detecting a sound activity from a directional sound;

Obtaining a characteristic value of a sound existing in the detected sound section;

Generating a sound event ID including the obtained sound characteristic value;

Counting the number of occurrences of sounds having the same sound event ID; And

Storing a sound event ID for a sound whose occurrence count exceeds a set value in the sound map

Sound map generating method comprising a.
The method of claim 1,

The step of extracting the directional sound,

A sound map generation method that performs spatial filtering of non-directional sound from the input sound.
The method of claim 1,

The sound characteristic value is,

A direction angle indicating a position of a sound source generating the directional sound as an angle; And

The frequency of the directional sound

Sound map generating method comprising a.
The method of claim 3,

The sound characteristic value is,

Sound map generating method further comprising at least one of the amplitude, sound pressure and tone of the directional sound.
The method of claim 1,

The sound map,

A sound map generation method for storing a plurality of sound event IDs having different sound characteristic values.
In the sound recognition method using the sound map generated by the sound map generation method of claim 1,

Receiving a sound including a voice command;

Generating a steering vector in the direction of noise by using the sound map;

Receiving noise by performing sound beam forming in the direction in which the steering vector is generated;

Calculating a power spectral density matrix of the input noise;

Calculating a PSD matrix of sound including voice commands; And

The step of deriving the PSD matrix of the voice command using the difference between the PSD matrix of the sound including the voice command and the PSD matrix of the input noise as the PSD matrix of the voice command

Sound recognition method comprising a.
The method of claim 6,

The step of forming the steering vector,

A sound recognition method for acquiring information on the direction of noise from a pre-generated and stored sound map, and generating a steering vector in the direction of the acquired noise.
The method of claim 6,

The step of receiving noise by performing the sound beamforming,

Receiving a sound generated in a direction indicated by the steering vector; And

Step of removing human voices that are not registered in the previously created sound map

Sound recognition method comprising a.
The method of claim 6,

Calculating the power spectral density (PSD) of the input noise

Including more,

A sound recognition method for calculating a PSD matrix of the input noise from a plurality of PSDs of the input noise.
The method of claim 6,

A sound recognition method in which sound is input into a multi-channel microphone provided in a robot that recognizes sound.
The method of claim 10,

The robot has a rotatable head,

The multi-channel microphone is provided in the head portion and rotates with the rotation of the head portion.
The method of claim 11,

The sound characteristic value is,

Includes a direction angle indicating a position of a sound source generating the directional sound as an angle,

The direction angle is,

A sound recognition method derived by compensating the angle at which the head portion is rotated with the angle measured by the multi-channel microphone.
The method of claim 12,

The steering vector is

A sound recognition method derived by compensating for an angle in which the head unit rotates in the direction angle.
In the sound recognition method using the sound map generated by the sound map generation method of claim 1,

A sound event ID comparison step of comparing the first sound event ID generated in the step of generating the sound event ID and the previously generated and stored second sound event ID with each other;

An abnormal sound detection step of determining that an abnormal sound has been detected when the first sound event ID does not match the second sound event ID; And

Alarm step notifying the user when the abnormal sound is detected

Sound recognition method comprising a.
The method of claim 14,

The sound event ID comparison step,

A sound recognition method for comparing a direction angle indicating a position of a sound source that generates a directional sound held by the first sound event ID and the second sound event ID in an angle, and a frequency of the directional sound.
The method of claim 15,

The sound event ID comparison step,

A sound recognition method for comparing at least one of an amplitude, a sound pressure, and a tone color of the directional sound held by the first sound event ID and the second sound event ID.
Receiving a sound input;

Extracting directional sound;

Detecting a sound activity from a directional sound;

Obtaining a characteristic value of a sound existing in the detected sound section;

Generating a sound event ID including the obtained sound characteristic value;

Counting the number of occurrences of sounds having the same sound event ID;

Storing a sound event ID for a sound whose occurrence frequency exceeds a set value in the sound map;

Receiving a sound including a voice command;

Generating a steering vector in the direction of noise by using the sound map;

Receiving noise by performing sound beam forming in the direction in which the steering vector is generated;

Calculating a power spectral density matrix of the input noise;

Calculating a PSD matrix of sound including voice commands; And

The step of deriving the PSD matrix of the voice command using the difference between the PSD matrix of the sound including the voice command and the PSD matrix of the input noise as the PSD matrix of the voice command

Sound recognition method using a sound map comprising a.
Receiving a sound input;

Extracting directional sound;

Detecting a sound section from the directional sound;

Obtaining a characteristic value of a sound existing in the detected sound section;

Generating a sound event ID including the obtained sound characteristic value;

Counting the number of occurrences of sounds having the same sound event ID; And

Storing a sound event ID for a sound whose occurrence frequency exceeds a set value in the sound map;

A sound event ID comparison step of comparing the first sound event ID generated in the step of generating the sound event ID and the previously generated and stored second sound event ID with each other;

An abnormal sound detection step of determining that an abnormal sound has been detected when the first sound event ID does not match the second sound event ID; And

Alarm step notifying the user when the abnormal sound is detected

Sound recognition method using a sound map comprising a.
A computer program stored in the computer-readable recording medium to execute the method of any one of claims 1 to 18 using a computer.