WO2022029305A1

WO2022029305A1 - Smart learning method and apparatus for soothing and prolonging sleep of a baby

Info

Publication number: WO2022029305A1
Application number: PCT/EP2021/072037
Authority: WO
Inventors: Ignacio VALLEDOR DE LA CAÑINA
Original assignee: Lullaai Networks,Sl
Priority date: 2020-08-07
Filing date: 2021-08-06
Publication date: 2022-02-10

Abstract

In particular examples of the present disclosure, a first neural network, or a convolutional neural network may first classify sound signals by cry or not cry; secondly, a second neural network, or playing reinforcement learning neural network may adapt to different sleep routines of different babies. Finally, a third neural network, or recurrent neural network, RNN, may generate music.

Description

SMART LEARNING METHOD AND APPARATUS FOR SOOTHING AND PROLONGING SLEEP OF A BABY

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority from patent application number EP20382741 filed on 7^th Aug 2020.

FIELD

[0002] The present disclosure relates to methods for playing a customized sound, wherein the customized sound is based on a historical pattern of an incoming sound signal and a probability that the incoming sound signal comprises a pattern.

BACKGROUND

[0003] Research directed towards natural language processing has been developed. In particular, a human voice or a baby’s cry signal, which serve as means of communication for humans, may be analyzed by some systems in time and frequency domains. Based on audio features, some methods may classify acoustic signals to specific cry meanings for cry language recognition. Once the meaning is understood, the current methods fail to provide means for effectively calming the baby.

[0004] Attempts have been made to develop methods to create or induce calm on an infant and to develop calming or sleep devices to deliver sound conveniently, such as the documents CN 110197677 “Playing control method and device and playing apparatus” or WO2015102921 “Modifying operations based on acoustic ambience classification” in which methods for detecting an emotional mood of a baby by a neural network are disclosed, and a pre-established sound is played depending on the emotional mood detected. Such infant calming or sleep methods and devices typically deliver fixed and unchangeable sound for different babies. Each baby, however, may have preferences on the level of sound that induces calming and sleep most efficiently.

[0005] There is a need to develop methods which give an effective and customizable response to a baby’s needs.

SUMMARY

[0006] A purpose of the present disclosure is to provide a computer implemented method for playing a sound comprising:

- analyzing one or more sound signals;

- providing a probability that the sound signal comprises an objective sound pattern; - playing a customized sound wherein the customized sound comprises at least a sound sequence customized by a playing neural network depending on:

- the provided probability;

- a historical pattern of the one or more sound signals.

[0007] In the context of the present disclosure, a sound signal may be received by the disclosed methods by means of a microphone or any acoustic sensor. A sound signal may be composed of acoustic waves representing a sound, typically using a level of electrical voltage for analog signals, and a series of binary numbers for digital signals. Audio signals may have frequencies in the audio frequency range of 20 to 20,000 Hz, which correspond to the lower and upper limits of human hearing.

[0008] Analyzing a sound signal may comprise analyzing or evaluating acoustic features of the sound signal. In some examples, the acoustic features or extracted features comprise sequences of root means square, RMS, Mel-frequency cepstral coefficients (MFCCs) and zero crossing rate, ZCR, of the sound signal.

[0009] The methods of the present disclosure provide a probability that the sound signal comprises an objective sound pattern, based on the analyzed features of the sound signal. A probability that the sound signal comprises an objective sound pattern may comprise a probability that the sound signal comprises at least a time sequence in a predetermined range of frequencies. In some examples, a range of frequencies of the human voice may be detected during a time sequence, wherein for example, the voice of an adult male may have a fundamental frequency from 85 to 180 Hz, and that of an adult female from 165 to 255 Hz. In some examples, acoustic characteristics of a baby cry from the 3rd to 5th day on up to the age of 1 year increase the fundamental frequency of an adult from 441.8 to 502.9 Hz. An objective sound pattern may comprise an adult voice pattern or a child voice pattern or a baby cry pattern wherein a pattern may be an acoustic frequency. A pattern may also comprise a pitch, an amplitude of the analyzed sound signal or any other acoustic feature.

[0010]

The probability that the sound signal comprises an objective sound pattern may be provided by an algorithm to detect an acoustic pattern known in the art. In some examples, a speech recognition system may provide a probability that a male adult is speaking, or a probability that an adult is shouting or that an animal is producing a sound, or a probability that a child is crying. The speech recognition system may be based on any existing methods for discriminating a particular acoustic or sound pattern. In some examples of the present disclosure, a probability that the sound signal comprises an objective sound pattern may be provided by an analyzing neural network, which may be a convolutional neural network, or any of the neural networks for detecting an emotional mood the prior art. The present disclosure defines therefore analyzing, which in an example is defined as “receiving at least extracted features of the sound signal by an analyzing neural network”. In an example the present disclosure defines therefore a first neural network for providing a probability that the sound signal comprises an objective sound pattern, or 1st neural network.

[0011] A neural network comprises a collection of connected or linked units or nodes called artificial neurons, which intend to model the neurons in a biological brain. Each connection or link between neurons may comprise a weight or synaptic weight, which determines the strength of a node's influence on another. A convolutional network may be advantageous in cases where different filters or convolutions or patterns are applied to one ordered matrix of data, for example an image, spectrogram. In the context of the present disclosure, the neural network uses audio features, for example, Mel-frequency-cepstral coefficients, MFCCs, for which convolutions, also known as filters, are useful for detecting some patterns in the audio features. Such audio features help identifying a speaking feature or a baby’s cry or any other objective sound pattern.

[0012] In some examples, analyzing a sound signal comprises receiving at least extracted features of the sound signal by the analyzing neural network, or 1^st neural network. The features of the sound signal may be extracted by any processing means or by a feature extracting module which may implement acoustic signal treatment or audio signal processing such as time and frequency domain processing. The extracted features may comprise harmonicity, pitch, energy, zero-crossing rate, entropy, Mel-frequency-cepstral coefficients, MFCCs, and/or any other audio feature. The extracted features may be received by the analyzing neural network.

[0013] In some examples, the analyzing neural network may receive at least a sequence of: a number N of Mel-frequency-cepstral coefficients of the sound signal, N preferably being 38 coefficients; a Zero Crossing Rate, ZCR, of the sound signal; and a root mean square, RMS, of the sound signal.

[0014] Advantageously receiving a sequence of 38 Mel-frequency-cepstral coefficients provides for a correct learning by the neural network and it may be an optimum number of coefficients for applications such as voice processing.

[0015] In some examples, more than one sequence of audio features is received by the analyzing neural network (NN), advantageously providing the neural network with more than one incoming sequence such that the analysis may become more accurate. The use of various audio sequences may permit corresponding NN algorithms to use sound variation(s) occurring over time. Such variation(s) may conform very relevant data for the analyzing NN to, for example, determine whether specific sound corresponds to a baby cry or another event, whether sound intensity is increasing or decreasing, etc.

[0016] In some examples, each of all or part of said input sequences may be a sequence of 4 time intervals with 40 audio characteristics experimentally obtained (training input), and a label identifying whether in this case there is a baby cry or not (expected output for the training input). This sequence of 4 time intervals may include e.g. 38 MFCC features, 1 RMS feature and 1 ZCR feature, since numerous experiments have demonstrated that said sequence configuration enables the neural network to learn optimally. The training with such sequences may further include using an ADAM type optimizer, a "binary cross-entropy" type loss function, and during a number of times sufficient to reduce the error, without falling into an overfitting.

[0017] The Mel coefficients have been proved to be representative of speaking features, which in the case of the present disclosure represents an advantage for recognizing either a speaker or sounds emitted by a baby.

[0018] In some examples, analyzing a sound signal comprises using a trained neural network to analyze a sound signal, wherein the neural network has been trained with sound data from one or more sound signals. Sound data may comprise extracted features or audio features of one or more sound signals.

[0019] In some examples, the neural network (NN) has been trained with sound data wherein the sound data may comprise a number N of coefficients of a Mel filter of one or more sound signals, N preferably being 38 coefficients, a Zero Crossing Rate, ZCR, of the one or more sound signals and a root mean square, RMS, of the one or more sound signals. In some examples, the neural network may have further been trained with one or more labels identifying whether there is a cry in the sound data. Training a neural network may be iteratively performed (during various epochs) until validation loss and accuracy do not improve, or even they begin to worsen (overfitting). Validation loss and accuracy may correspond to e.g. a sum of errors made for each example in training or validation datasets. Loss value may indicate how poorly or well a model behaves after each iteration of optimization. Accuracy value may be used to measure the NN's performance in an interpretable way.

[0020] Once the probability that the sound signal comprises an objective sound pattern is provided, the methods of the present disclosure comprise playing a customized sound. The playing may be performed by means of any type of electroacoustic transducer, such as a device converting electric energy into acoustic energy, for example, a speaker, or any other device.

[0021] In some examples, the incoming sound signal is a human voice or human cry and the played sound is a melody. In some examples the played sound is an alert or an alarm. The methods of the present disclosure comprise playing the customized sound which comprises at least a sound sequence customized by a playing neural network, or 2^nd neural network, depending at least on the provided probability and a historical pattern of the one or more sound signals. In such a manner, the playing neural network, or 2^nd neural network, may adapt to different sleep routines of different babies and may also generate music.

[0022] In some examples, before playing the customized sound, the methods of the present disclosure may comprise generating or synthesizing such customized sound to be played. Generating or synthesizing such customized sound may be performed by the playing neural network or by a further neural network, or 3rd neural network, or recurrent neural network, RNN. The methods may comprise any type of generating or synthesizing method and may generate or synthesize at least a sound sequence, wherein generating the sound sequence may comprise receiving a historical pattern of sound signals and the provided probability. In some examples, generating the sound sequence may comprise further receiving at least an extracted feature of the sound signal and current time and/or date.

[0023] As disclosed, the customized sound comprises at least a sound sequence generated by a neural network depending on incoming parameters comprising, but not limited to, a historical pattern of sound signals and the provided probability. The sound sequence may also be generated depending on further incoming parameters such as at least an extracted feature of the sound signal, and a current time and/or date. For example, an extracted feature of the sound signal may comprise a root mean square, RMS, of the sound signal representing an indicative of a mean value of the power or the peak value of the sound signal. The sound sequence may also be generated depending on further incoming parameters such as the current playing status: either in silence or reproducing a sound. A further incoming parameter may comprise the current volume being played by the methods of the present disclosure. The sound being reproduced may be any type of customized sound which has been chosen or generated by the playing neural network - 2^nd neural network- which may comprise lullabies, nature sounds, white noise, or an infinite melody generated by the third neural network. Generating an infinite melody by a RNN -by a 3rd neural network- is a step comprised further to the previous steps implemented by the 1st and the 2nd neural networks. The 2nd neural network, based on the particular baby in question, decides whether or not, for a particular probability of cry or other pattern, an infinite melody is to be played. The 2nd neural network might have learnt that if a particular baby is crying at 3am, a white noise works for calming him/her. But in a different implementation of the method with a different baby, the 2nd neural network might have learnt that if the different baby is crying at 3am, an infinite melody works for calming him/her. In the case where the 2nd neural network has learnt that an infinite melody works, then the 3rd neural network comes into play and generates an infinite melody for that baby, for that moment in time and for that pattern (for example cry). An output of the playing neural network or the 2^nd neural network may also comprise a volume indicator, indicating the volume at which the output sound must be reproduced. The volume may be comprised between predefined values, for example 0... 100. The volume may be determined by the playing neural network or 2^nd neural network depending on the incoming parameters comprising at least the provided probability and a historical pattern of the sound signals. In particular examples, the 1^st neural network detects a probability that a particular baby is crying, the 2^nd neural network receives such probability and decides, based on the probability and the historical pattern of the particular baby, which sound must be reproduced or output, at which volume, etc. The decision is taken based on the particular baby’s tastes and depending on what has worked historically with such particular baby.

[0024] Advantageously, a playing neural network may learn or be trained or be built while historical records are examined from a database or received and stored by any storage medium or stored in the playing neural network. In some examples, the playing neural network may use an ability to learn and associate historical patterns of the one or more sound signal and the provided probability to customize a personalized sound. For example, if the provided probability indicates a probability that an adult is screaming out for help, the customized sound may be an SOS alert. If, for example, the provided probability indicates a probability that a baby is crying, the customized sound may be a calming sound or an alert for the parents, depending on what the neural network has learnt or learns, from the historical pattern, which is the appropriate response to such incoming parameters.

[0025] In examples, all neural networks used in methods according to present disclosure may be deep neural networks and accordingly multilayer networks, i.e. made up of more than one layer of neurons.

[0026] In the present disclosure, the learning may be based on training the playing neural network with incoming parameters such as the provided probability and the historical pattern of the one or more sound signals, which may be referred to as training parameters. The training parameters may be iteratively received, for example during some time for several times a day, and the playing neural network may change the synaptic weights until they converge to a set of optimal weights that provide the most effective response, for example, in the case that the method is used for calming a baby, the most effective result may be that the baby does not cry during the implementation time; then the synaptic weights are adjusted to give a customized sound. The synaptic weights may converge to a different optimal weight each time the method is implemented during a long period of time, for example, the training may be continuous for some years and evolve with the evolution of the incoming parameters. In the particular case where the incoming sound is a baby sound, the method may be iterated several times a day during a long period of time and the playing neural network -2^nd neural network- may optimize the synaptic weights over time to evolve with the evolution of the baby. For the playing neural network to learn, the methods of the present disclosure may comprise a supervised learning neural network, or an unsupervised learning network, or a network using a hybrid formula, wherein the neural network learns whether the construction of the synaptic weights is correct by analyzing a new provided probability. Such type of network may be referred to as a reinforcement learning playing neural network or as a Deep Reinforcement Neural Network.

[0027] The (deep) reinforcement neural network -2^nd neural network- may be trained during a learning period of time, for example during some weeks. It is to be understood that the training period may be comprised in the working period, or in other words, the playing neural network may be used while being trained. In this way the first time that the playing neural network is used may not work properly and may not provide a customize sound which is effective, but as long as the playing neural network is used, the weights are more and more optimized for providing effective results. As the playing neural network is a reinforcement neural network, hyperparameters are limited to the decisions taken by the neural network and updated as long as the neural network runs. Parameters used for training the playing neural network may comprise the time a baby has cried for or the time the baby has not cried, the hour of a day, which may have an impact in the metabolism of the baby, on external noises, etc. The training of the reinforcement neural network does not comprise validation or testing since the running of the network is the actual validation and testing. A reinforcement neural network is advantageous as playing neural network since it learns as long as it is being used with the incoming sound for which the network needs to be optimized. In the case of using the methods of the present disclosure with a baby, the playing neural network being a deep reinforced neural network learns and is trained as long as it is being used with the baby, and the weights, nodes, layers, etc are defined as long as output sounds are being reproduced. The playing neural network being a Deep reinforcement neural network is based on operant conditioning (also called instrumental conditioning) developed by B.F. Skinner (1904-1990).

[0028] In the present disclosure, an example implementation may be that a baby cries - the provided probability indicates that a baby may be crying-. In such example implementation, the playing neural network may synthesize a sound which is played by a transducer. If a following provided probability indicates that the baby may continue crying, the reinforcement learning playing neural network (or Deep Reinforcement Neural Network) may change the synaptic weights. If the new configuration of the reinforcement learning playing neural network, i.e. with the synaptic weights changed, results in a further provided probability being lower than the two previous provided probabilities, then the reinforcement learning playing neural network learns that the synaptic weights are being changed in the right direction, since this would mean that the baby is crying with less probability, i.e. the baby may have stopped crying. Such playing process may iteratively be repeated until a minimum probability is reached, or until the probability of cry is zero. If the method is implemented another day or week, the weights may no longer be optimized and a minimum probability may no longer be reached; in this case, the playing neural network continues learning and changing the synaptic weights to adapt to the new scenario. In some examples an assembly of the provided probabilities at each iteration may serve to build what is referred to the historical pattern of sound signals. In some examples, the historical pattern of sound signals may be received from a data base.

[0029] In some examples, the methods of the present disclosure comprise customizing the sound sequence, wherein customizing the sound sequence comprises: receiving at least a feature of the sound signal, a historical pattern of sound signals, the provided probability, a current timestamp, and a state and, periodically or iteratively, providing an output sound for maximizing a reward and evaluating the reward after a predefined amount of time, for example after 2 minutes. In some examples, the iterations last until the reward is maximized or reached. The playing neural network may be called periodically, for example every minute or every 2 minutes, and the playing neural network may provide an output sound, for example, a white noise which may be played at a certain volume. The objective of the neural network is to maximize a reward, wherein a reward may comprise a number of seconds a baby does not cry within a predefined amount of time, for example within the last minute or 2 minutes or within the last hour. After such predefined amount of time, the playing neural network evaluates what happened, for example how many seconds the baby has cried within the last minute, and uses the evaluation as a performance measure. For example, a Q-learning algorithm may be used, which tries to maximize the reward, the reward being that the baby does not cry for a longest amount of time, which may take a numerical value of “1”, or for a longer amount of time than the last iteration, wherein the last iteration may comprise time the playing neural network was called. In some examples, the playing neural network is called every two minutes. After such 2 minutes at every iteration, the playing neural network evaluates the reward: if the baby did not cry for the 2 minutes, the reward may be “1”. If the baby did cry during the 2 minutes, the reward may be “-1”. Apart from a partial reward every two minutes, as has been explained herein, the method may evaluate a “global reward” or “delayed reward”. The delayed reward may give feedback about long term results of decisions taken at some point during the iterations. For example: the playing neural network -2^nd neural network- may have determined to play white noise after a 2 minute iteration. The result may be that the baby cried during 2 minutes, for which the partial reward is “-1”. The baby may however stop crying after 10 minutes, so even if the “partial reward” was negative, the “delayed reward” shows that the decision of having played white noise was not such a bad decision actually. The weight of the playing neural network may be updated accordingly.

[0030] In some examples, the reinforcement learning playing neural network (or Deep Reinforcement Neural Network) may also customize a sound depending on at least an extracted feature of the sound signal, which may comprise an RMS, in which case the customizing process may iteratively be repeated until an RMS lowers down to a predetermined threshold.

[0031] In some examples the reinforcement learning playing neural network (or Deep Reinforcement Neural Network) or a further neural network called synthesizing neural network, 3^rd neural network, for example a (deep) recurrent neural network, RNN, may further synthesize a sound depending on a current time, in which case, the synaptic weights may be adapted to different times of the day and the synthesizing process may iteratively be repeated until one or more of the provided probability and the RMS lower down to respective thresholds. In some examples, the neural networks may further synthetize a sound depending on a measured room or body temperature, in which case, the synaptic weights may be adapted to different environmental conditions.

[0032] In some examples, playing a sound further comprises generating the at least sound sequence, wherein generating the sound sequence comprises generating an infinite melody. In some examples, the infinite melody is generated by a (deep) recurrent neural network, RNN. An infinite melody may be a melody which is based on a base acoustic sequence and which may be built depending on an objective. For example, an RNN may provide a first acoustic sequence and a second acoustic sequence which may be juxtaposed. The RNN may observe whether an objective has been reached or approached. If the result diverges from the objective, then the RNN may modify the second acoustic sequence and juxtapose it to the first one in order to further observe the new resulting melody. The RNN may try different combinations in order to approach or achieve the objective. An undefined number, also known as “infinite” number of sequences may be juxtaposed in order to build or create an infinite melody.

[0033] The RNN may be independently trained for each particular baby. The RNN may be trained from inputted melodies or a set of lullabies or melodies that may be in e.g. ASCII format. If melodies are in a format other than ASCII, such as e.g. MIDI format, a conversion from MIDI into ASCII may be performed. Once in ASCII format, they may be used to train each of all or part of the neural networks. These inputted melodies or set of lullabies may correspond to those melodies that have been determined as most liked by particular baby. The inputted melodies may be inputted by a user or by the parents of a baby. Therefore, a tailored neural network may be obtained for each baby based on his/her tastes. The neural network may function character by character. For example, each character may represent a MIDI note in ASCII format. Melodies most liked by particular baby may be determined from accumulated output produced by playing neural network.

[0034] Once melodies most liked by the baby have been selected, all (or some) different notes forming such melodies may be identified. Training may be performed according to a sliding window approach on identified notes. Given a melody whose notes have been selected to train the NN, a window Wof N (e.g. 15) consecutive notes may be defined as training input, and a next consecutive note following last consecutive note within the window may be defined as expected output for said input. Once current melody has been processed, with as many training iterations as consecutive notes conform the current melody, a next (selected) melody may be processed according to same or similar sliding window approach. Training of the RNN generating the infinite melodies may be based on the melodies which are inputted. Such set of inputted melodies may be separated into 3 groups: 70% of the melodies for training, 20% of the melodies for validation and 10% of the melodies for testing. The hyperparameters may include the diversity or temperature which may be 0,2, 0,5, 1 ,5, 1 ,2, the number of layers, which in some examples may be LSTM 256. [0035] Once the neural network has been trained, when in use, it may generate first a random note, and later, it may generate new notes based on notes generated before. This way, the neural network may generate new melodies infinitely, i.e. without predefined or expected end in the sense that there is always a next possible note. This neural network may have as input the last 10 notes or 20 - 30 notes (in ASCII format) generated so far. The output may be a probability distribution (e.g. softmax function) of what note among possible notes is more appropriate to use as next note.

[0036] In a second aspect, there is disclosed an apparatus comprising a storage medium with instructions which, when executed by a processor, cause the processor to carry out one of the methods according to the present disclosure. The apparatus may be a smartphone configured with instructions for implementing any of the methods described herein. The instructions may be provided by means of a smartphone application or Mobile App. The instructions may be provided by means of remote instructions which may be read from a cloud server by the smartphone. A neural network may be running locally on the smartphone, thereby providing better performance than in the case where a neural network runs remotely. Furthermore, privacy issues may be overcome when running a neural network locally. The Mobile App may run on the smartphone in Airplane Mode and may be equipped with logging requirements, user information, night or day records, and so on.

[0037] The apparatus may be a child’s toy, or a child object. The child’s toy may be equipped with any peripheral, such as gesture sensors, a battery, a charger, a microprocessor configured to implement any of the methods disclosed herein, etc. In some examples, the apparatus or child’s toy or the child object may be equipped with sensors configured to sense a room or body’s temperature. In some examples, the apparatus or child’s toy or the child object may be equipped with sensors configured to sense environmental conditions. In such examples, the reinforcement learning playing neural network (or Deep Reinforcement Neural Network) may synthetize a sound depending on a measured room or body temperature or environmental conditions. In such cases, the synaptic weights may be adapted to different environmental conditions or temperatures. [0038] The apparatus may be a case comprising a storage medium.

[0039] The features, functions, and advantages that have been discussed can be achieved independently in various examples or may be combined in yet other examples further details of which can be seen with reference to the following description and drawings.

BRIEF DESCRIPTION OF THE FIGURES

[0040] FIG. 1 is a flow chart of an example of a method 100 according to the present disclosure.

[0041] FIG. 2 is a flow chart of an example of a method, which is implemented in a particular system 200. [0042] FIG. 3 is a flow chart of an example of a method 300 according to the present disclosure.

[0043] FIG. 4 is a block diagram of an example of a system according to the present disclosure.

DETAILED DESCRIPTION

[0044] The following detailed description of examples refers to the accompanying drawings, which illustrate specific examples of the disclosure. Other examples having different structures and operations do not depart from the scope of the present disclosure. Reference numerals may refer to the same element or component in the different drawings.

[0045] The present disclosure may be implemented by a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

[0046] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0047] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0048] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture, or ISA, instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, statesetting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some examples, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0049] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus or systems and computer program products according to examples of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

[0050] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks. [0051] FIG. 1 is a flow chart of an example of a method 100, which may be implemented by a processing means, according to the present disclosure. In block 101 the method comprises analyzing a sound signal; in block 102 the method comprises providing a probability that the sound comprises an objective sound pattern; in block 103 the method comprises customizing, by a playing neural network or 2^nd neural network, a sound wherein the customized sound comprises at least a sound sequence depending on the provided probability and a historical pattern of sound signal; in block 104 the method comprises playing the customized sound wherein the customized sound comprises at least a sound sequence customized by a playing neural network depending on the provided probability and a historical pattern of sound signals. In examples, the sound signal analyzed (101) is the voice of a baby. The analysis provides a probability that the baby is crying. The playing neural network or 2^nd neural network infers for such baby which sound should be reproduced depending on the probability and the input voice and customizes (103) an output sound which is played (104) by the method of the present example.

[0052] In some examples, which may be part of the example of figure 1 , the playing neural network customizes a sound in the following manner: the playing neural network selects a sound from a list of 30-40 predefined sounds and decides when to activate them, for example at night, depending on a baby's sleep patterns. Such playing neural network may be directly trained on a user's mobile phone, building a customized neural network for each baby. For example, a baby may wake up every day around 12 o'clock at night, because a garbage truck passes by the house at that time making noise, so the playing neural network may therefore learn to anticipate those moments and may select white noise or any other sound the playing neural network understands that works best to keep the baby sleeping despite the noise that the garbage truck makes. In addition, the sound volume may also be modified to try to improve the baby's sleep. Different input variables may be used to create the playing neural network, which in some examples is a reinforcement learning neural network (or Deep Reinforcement Neural Network).

[0053] A method for customizing the sound sequence may comprise receiving:

- a historical pattern of sound signals, for example: a time, for example in seconds, that a baby has not cried, and/or a time the baby has been crying and/or a time since the baby started sleeping and/or age of the baby, for example in months,

- the probability which may represent a current state of cry or not cry,

- a feature of the sound signal, which may be a current noise or its estimation through an RMS value,

- a current timestamp or current date and/or time.

[0054] The method may further comprise periodically providing an output sound for maximizing a reward and evaluating the reward after a predefined amount of time. For example, the playing neural network may be called periodically, for example every minute, and the playing neural network may provide a customized output sound, for example, a white noise which may be played at a certain volume. The objective of the playing neural network is to maximize a reward, wherein a reward may comprise a number of seconds a baby does not cry within a predefined amount of time, for example within the last minute or within the last hour. After such predefined amount of time, the playing neural network evaluates what happened, for example, how many seconds the baby has cried within the last minute and uses the evaluation as a performance measure. For example, a Q-learning algorithm may be used, which tries to maximize the reward, the reward being that the baby does not cry for a longest amount of time, or for a longer amount of time than the last time the playing neural network was called.

[0055] In some examples, the methods may comprise further generating the sound sequence by a synthesizing neural network. The generation of a sound sequence may comprise receiving:

- the probability which may represent a current state of cry or not cry,

- a current timestamp or current date and/or time.

[0056] The method may further comprise periodically providing an output sound for maximizing a reward and evaluating the reward after a predefined amount of time. For example, the synthesizing neural network may be called periodically, for example every minute, and the synthesizing neural network may provide an output sound, for example, a white noise which may be played at a certain volume. The objective of the synthesizing neural network is to maximize a reward, wherein a reward may comprise a number of seconds a baby does not cry within a predefined amount of time, for example within the last minute or within the last hour. After such predefined amount of time, the synthesizing neural network evaluates what happened, for example, how many seconds the baby has cried within the last minute and uses the evaluation as a performance measure. For example, a Q-learning algorithm may be used, which tries to maximize the reward, the reward being that the baby does not cry for a longest amount of time, or for a longer amount of time than the last time the playing neural network was called. For maximizing the reward, the synthesizing neural network may create melodies or may juxtapose different acoustic sequences which may approach to an objective, for example calming a baby.

[0057] FIG. 2 is a flow chart of an example of a method 200 according to the present disclosure. The method comprises providing 201 a sound signal. In particular examples, the sound is a cry and is received by a microphone 202. The method further comprises analyzing 203 the sound signal, wherein analyzing 203 the sound signal comprises receiving 204, by a convolutional analyzing neural network, 4 time sequences of 38 coefficients of a Mel filter of the sound signal, a Zero Crossing Rate, ZCR, of the sound signal and a root mean square, RMS, of the sound signal. The method 200 comprises providing 205 a probability P that the sound comprises a baby’s cry. The method 200 comprises customizing 206, by a playing neural network, a customized sound wherein the customized sound comprises at least a sound sequence depending on the provided probability P and a historical pattern of the provided sound signal at 201 , which may comprise historical records of a baby crying, intensity of the cry, historical audio parameters, such as RMS, and times of the day when the baby has historically cried. The method comprises playing 207 the customized sound. The method may be implemented in a mobile phone comprising the microphone 202 and a speaker 208. The playing neural network may be a reinforcement learning neural network, and more particularly a Deep Reinforcement Learning neural network (NN).

[0058] This (Deep) Reinforcement Learning NN may receive as input all or some of following data: temporal sequence of current audio features (e.g. RMS value(s) that may indicate current noise in a room or the like), output from convolutional analyzing neural network (baby cry detector to know if the baby is currently crying or not), time during which baby has been sleeping, time of the day in seconds (from 00:00), etc. Although other input parameters could be relevant and accordingly used. The (deep) reinforcement learning NN may produce as output data to reproduce some predefined sound, and/or call recurrent neural network (RNN) to generate customized lullabies, and/or simply nothing. These outputs may further comprise data indicating at which volume predefined sound and/or customized lullaby or lullabies will be reproduced.

[0059] The (deep) reinforcement learning NN may produce, for example, two outputs. First output may be a value indicating a sound to be played selected from a predefined list of sounds, or that nothing is to be played at this time, or that recurrent neural network is to be called for customized lullaby generation for given time (e.g. about 10 minutes). Second output may be a value (e.g. between 0 and 1) indicating a volume at which predefined sound or customized lullaby indicated by first output is to be played. Continuous calls to (deep) reinforcement learning NN may be performed to determine which sound is to be reproduced at each time. Such continuous calls may be suspended during reproduction of sound/lullaby and resumed once sound/lullaby has ended.

[0060] FIG. 3 is a flow chart of an example of a method for training a convolutional neural network for classifying a baby’s cry or for giving a probability P that a baby is crying. The example method for training comprises providing 301 a set of 4 time sequences of audio features comprising, each:

- a label identifying whether there is a cry in a sound and

- 40 audio features: 38 Mel coefficients, a RMS value and a ZCR value.

[0061] Said sequences are used to train the neural network by iterating 302 providing the labels and the audio features. An optimizer of the ADAM type may be used. A loss function of the type "binary cross-entropy" may also be used. The training may be performed for a period such that a compromise or balance between a good training and not falling in an overfitting situation is achieved.

[0062] In particular examples of the present disclosure, a first neural network, or a convolutional neural network may first classify sound signals by cry or not cry; secondly, a second neural network, or playing reinforcement learning neural network may adapt to different sleep routines of different babies. Finally, a third neural network, or recurrent neural network, RNN, may generate music. Such a process provides for an intelligent baby sound tracking based on a particular baby’s sleep routines. In some examples, the generation of music may generate infinite lullabies or infinite melodies based on the babies preferred sounds. The generation of infinite lullabies may be performed in an infinite loop, and with different notes each time, to help the baby sleep well. The recurrent neural network, RNN, may create the sound of the lullaby in a text format which may be transformed to MIDI format and be played.

[0063] The RNN may be trained from a set of lullabies or melodies which may be in ASCII format. If necessary, melodies in a format different from ASCII may be reformatted into ASCII format for proper training. For example, melodies in MIDI format may be transformed into ASCII format previously to the training. Melodies in ASCII format may then be utilized to train corresponding RNN(s), which may thus operate in a character-by-character approach. The melodies to be used for training the RNN(s) may have been previously selected and inputted to the network. For example, in the case of a baby, the melodies may have been determined as most appropriate or most liked melodies according to each baby’s tastes, so that tailored RNN(s) may be provided depending on each targeted baby. Melodies most liked by particular baby may be determined depending on or from accumulated output produced by (reinforcement learning) playing neural network during given time. Output(s) of playing neural network may be saved and/or accumulated to determine which of reproduced melodies or lullabies return greater or better "reward" (e.g. baby cries less).

[0064] Once melodies most liked by the baby have been selected and inputted, all (or some) different notes forming such melodies may be identified. A dictionary of notes may be conformed from identified notes, which may be suitably formatted to provide the NN to be trained with appropriate training (and validation) input and expected output. A window of several (e.g. 15) identified notes may be used to train the NN. Training may be performed according to a sliding window approach. Given a melody whose notes have been selected to train the NN, a window W of N (e.g. 15) consecutive notes may be defined as input, and a next consecutive note following a last consecutive note within the window may be defined as expected output for said input. When window W (to be inputted) includes from melody(X) to melody(Y), that is, from X to Y consecutive notes in the melody, consecutive note Y+1 in the melody or melody(Y+1) may be defined as expected output (i.e. label or prediction that the NN has to learn) for said input. [0065] Once a current melody has been processed as defined above, with as many training iterations as consecutive notes conform the current melody, a next (selected) melody may be processed according to same or similar sliding window approach.

[0066] At first training iteration (1 = 1), it may be set that W(1) = melody(1), W(2..N) = 0, and label/output = melody(2). At second training iteration (I = 2), it may be set that W(1..2) = melody(1..2), W(3..N) = 0, and label/output = melody(3). At Nth training iteration (I = N), it may be set that W(1..N) = melody(1..N), and label/output = melody(N+1). At (N+1)th training iteration (I = N+1), it may be set that W(1..N) = melody(2..N+1), and label/output = melody(N+2). At (N+2)th training iteration (I = N+2), it may be set that W(1..N) = melody(3..N+2), and label/output = melody(N+3). This algorithm may be formally expressed as follows: at each iteration I <= N, window W may be provided as input with W(1..l) = melody(1..l) and W(I+1..N) = 0, and label L may be provided as expected output with L = melody(l+1); and at each iteration I > N, window W may be provided as input with W(1..N) = melody(1+l-N..I), and label L may be provided as expected output with L = melody(l+1).

[0067] Once the training has been completed, because sufficient or desired variety of samples have been used for training and/or because post-training verifications have confirmed adequate functioning of the neural network, RNN(s) may operate by firstly generating a random note and, subsequently, new notes based on previously generated notes. This may imply that RNN(s) may generate new musical notes infinitely, i.e. without predefined or expected final musical note. RNN(s) may take into account a history of previously generated notes. Said history may include e.g. last 20 - 30 notes (in ASCII format) generated before by the RNN itself. RNN(s) may output as a result a probability distribution of notes denoting a particular note from all possible notes which is probably most appropriate note to be used next. This probability distribution may be generated by using a function aimed at that purpose, such as e.g. softmax function.

[0068] Generating an infinite lullaby may comprise

- selecting a first melody feature parameter from a plurality of melody feature parameters;

- generating a first melody based on the first melody feature parameter and a music style,

- generating a second melody based on the first melody, the second melody being different from the first melody, and a music style of the second melody being the same as the music style of the first melody;

- juxtaposing the second melody to the first melody.

[0069] In some examples, the generated sound may comprise dialogs based on decision graphs which may understand a user language, and a mobile App may provide conversational chatbot intent analysis or conversational decision graph based on a user, or parent expert, interaction.

[0070] In some examples, an intelligent baby sound tracking, for example a night tracking, may allow the methods and apparatuses and systems of the present disclosure to provide calming sounds, environmental tips on the room, such as the temperature, or the light, etc. Depending on the behavior of the child, a personalized plan may be provided which may support and enhance parenting. The systems of the present disclosure may be understood as scalable and evolving health systems, which in some cases may below cost systems, due to the development of the methods of the present disclosure in mobile applications.

[0071] FIG. 4 is a block diagram of an example of a system according to the present disclosure. It is shown how each of three neural networks conforming this example may be integrated and may work in coordination, and what is the input and output of each neural network. As shown in the figure, systems according to present disclosure may comprise a convolutional neural network 401 to classify input sound signal(s) 400 as cry or non-cry, a deep reinforcement learning neural network 404 to select optimal actions or to determine which type of customized sound is to be played depending on the classification performed by the convolutional neural network 401 , and a deep recurrent neural network 406 to generate customized infinite melody or melodies if the reinforcement learning neural network 404 has selected or determined such an action as to be performed. Details about such convolutional neural network 401 , deep reinforcement learning neural network 404 and deep recurrent neural network 406 that may be used in systems according to present disclosure are provided in other parts of the disclosure.

[0072] As illustrated in FIG. 4, convolutional neural network 401 may receive as input 400 one or more sequences of audio features, such as e.g. MFCC, RMS, ZCR, etc. to detect cry or non-cry depending on them. Such audio features that may be received and processed by convolutional neural network 401 are described in detail in other parts of the disclosure. Also according to FIG. 4, Q-learning may be used by deep reinforcement learning neural network 404 for optimal action selection. Apart from a cry or non-cry indication produced by the convolutional neural network 401 , deep reinforcement learning neural network 404 may further receive environmental parameters as input to perform fully customized action selection depending also on them. Such environmental parameters may comprise e.g. crying time, sleeping time, RMS (audio feature), temperature, etc. 402 and/or playing state, volume, time, day of year, etc. 403. Playing state may indicate e.g. whether some melody or sound(s) is being played, which melody or sound(s) is being played, point or part of the melody that is being played, etc. It is further shown in the figure that deep reinforcement learning neural network 404 may play predefined sounds/melodies (SOUND-1 , SOUND-2 ... SOUND-N) or stop sounds/melodies 405 instead of instructing deep recurrent neural network 406 to generate or play infinite personalized lullaby or lullabies.

[0073] The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0074] The terminology used herein is for the purpose of describing particular examples only and is not intended to be limiting of examples of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “include,” “includes,” "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0075] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present examples has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to examples in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of examples.

[0076] Although specific examples have been illustrated and described herein, those of ordinary skill in the art appreciate that any arrangement which is calculated to achieve the same purpose may be substituted for the specific examples shown and that the examples have other applications in other environments. This application is intended to cover any adaptations or variations. The following claims are in no way intended to limit the scope of examples of the disclosure to the specific examples described herein.

Claims

1. A computer implemented method (100) for playing a sound comprising:

- analyzing one or more sound signals which comprises analyzing or evaluating acoustic features of the one or more sound signals;

- providing a probability that the sound signal comprises an objective sound pattern based on the analyzed features of the one or more sound signals;

- playing a customized sound wherein the customized sound comprises at least a sound sequence customized by a playing neural network (404) depending on:

- the provided probability;

- a historical pattern of the one or more sound signals.

2. The computer implemented method (100) for playing a sound according to claim 1 wherein analyzing a sound signal comprises receiving at least extracted features of the sound signal by an analyzing neural network (401).

3. The computer implemented method (100) for playing a sound according to any one of claims 1 to 2 wherein analyzing a sound signal comprises receiving, by an analyzing neural network, at least a sequence of:

- a number N of coefficients of a Mel filter of the sound signal, N preferably being 38 coefficients; and

- a Zero Crossing Rate, ZCR, of the sound signal; and

- a root mean square, RMS, of the sound signal.

4. The computer implemented method (100) for playing a sound according to any one of claims 1 to 3 wherein analyzing a sound signal comprises

- training (300) a neural network (401) using sound data; and

- using the trained neural network to analyze (101 , 203) a sound signal.

5. The computer implemented method (100) for playing a sound according to claim 4 wherein training (300) the neural network (401) using sound data comprises:

- providing (301) to the neural network

- a label identifying whether there is a cry in the sound data, and - audio features of the sound data;

- iterating (302) until the neural network (401) validation loss or validation accuracy or both stop improving or begin to worsen, or overfitting occurs.

6. The computer implemented method (100) for playing a sound according to any one of claims 1 to 5 further comprising training the playing neural network (404), wherein training the playing neural network comprises:

- iteratively receiving training parameters, the training parameters being a provided probability that the sound signal comprises, an objective sound pattern and the historical pattern of the one or more sound signals;

- changing synaptic weights forming the playing neural network (404) until they converge to a set of optimal weights;

- adjusting the synaptic weights to give a customized sound.

7. The computer implemented method of claim 6 wherein the playing neural network (404) is a deep reinforcement Neural Network.

8. The computer implemented method of claim 7 wherein

- the training parameters comprise a provided probability that the sound signal comprises an objective sound pattern wherein the sound pattern is a baby crying sound signal; and wherein the method further comprises:

- synthesizing a sound which is played by a transducer,

- changing the synaptic weights by the reinforcement playing neural network (404) if a following provided probability indicates that the baby continues crying,

- if a further provided probability is lower than the two previous provided probabilities, then the reinforcement playing neural network (404) learns that the synaptic weights are being changed in the right direction, meaning that the baby is crying with less probability.

9. The computer implemented method (100) for playing a sound according to any of claims 1 to 8 wherein customizing the sound sequence comprises:

- receiving at least a feature of the sound signal, a historical pattern of sound signals, the probability, a current timestamp, and a state;

- periodically providing an output sound for maximizing a reward and evaluating the reward after a predefined amount of time.

10. The computer implemented method (100) for playing a sound according to any one of claims 1 to 9 wherein playing a sound further comprises generating the at least sound sequence, wherein generating the sound sequence comprises generating an infinite melody by a recurrent neural network (406) RNN, wherein an infinite melody is a melody without predefined or expected end in the sense that there always exists a next possible note.

11. The computer implemented method according to claim 10 wherein the recurrent neural network, RNN, is trained based on inputted melodies..

12. The computer implemented method according to claim 11 wherein the recurrent neural network (406) RNN, is trained according to a sliding window approach, which is based on a window of notes sliding along consecutive notes conforming a current melody within melodies most liked by particular baby.

13. The computer implemented method according to claim 12 wherein the RNN (406) is trained from a set of melodies in ASCII format, wherein the training may comprise:

- inputting the set of melodies to the RNN;

- identifying the notes forming such melodies;

- defining a window of several identified notes as input;

- a next consecutive note following a last consecutive note within the window is defined as expected output for said input.

14. The computer implemented method according to claim 12 wherein the sliding window approach includes performing as many training iterations as consecutive notes conform the current melody, wherein: at each iteration I <= N, window W is provided as input with W(1..l) = melody(1..l) and W(I+1..N) = 0, and label L is provided as expected output with L = melody(l+1); and at each iteration I > N, window W is provided as input with W(1..N) = melody(1+l-N..I), and label L is provided as expected output with L = melody(l+1); wherein

I = 1 , 2... to number of consecutive notes forming current melody,

W is the window of notes sliding along consecutive notes forming current melody, W(X..Y) is part of the window W from Xth position to Yth position in W, melody(X..Y) is part of the melody from Xth to Yth consecutive note in melody, and melody(X) is Xth consecutive note in melody.

15. The computer implemented method (100) for playing a sound according to any one of claims 1 to 14 wherein the sound signal is a human voice or human cry and the sound played is a melody.

16. An apparatus comprising a storage medium with instructions which, when executed by a processor, cause the processor to carry out the method (100) according to any one of claims 1 to 15.

17. An apparatus according to claim 16 wherein the apparatus comprises a smartphone, or a tablet, or a child’s toy or a child object.

18. A computer program product comprising instructions which, when executed by a processor, cause the processor to carry out a method for generating at least sound sequence according to any of claims 1 to 15.