CN111276159A

CN111276159A - Infant pronunciation analysis method and server

Info

Publication number: CN111276159A
Application number: CN201811480184.0A
Authority: CN
Inventors: 赵志英
Original assignee: Alibaba Health Information Technology Ltd
Current assignee: Alibaba Health Information Technology Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2020-06-12

Abstract

The application discloses a method for analyzing baby pronunciation and a server, wherein the method comprises the following steps: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component. Thereby accurately analyzing the physiological state corresponding to the sound emitted by the baby.

Description

Infant pronunciation analysis method and server

Technical Field

The embodiment of the specification relates to the technical field of voice, in particular to a baby pronunciation analysis method and a server.

Background

This section is intended to provide a background or context to the embodiments of the specification that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

The speech data contains a lot of information, for example when the infant shows physiological needs or signs of illness, often expressed by crying. Medical research shows that the crying signal of the baby has great potential medical value and hides various disease clues. If the meanings of various babies crying can be correctly distinguished, the infant crying distinguishing method can play a good auxiliary role in finding diseases in advance and caring for the babies.

Therefore, there is a need for a solution that can accurately analyze the information contained in the data of baby cry.

Disclosure of Invention

An embodiment of the present disclosure provides a crying analysis method and a server, which are used for accurately analyzing a physiological state corresponding to a sound made by an infant.

The baby pronunciation analysis method and the server provided by the embodiment of the specification are realized in the following modes:

in a first aspect, a method for analyzing baby pronunciation is provided, which is applied to a server, and the method includes:

framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In a second aspect, a server is provided, including:

the device comprises a preprocessing module, a display module and a display module, wherein the preprocessing module is used for framing audio data of sound emitted by a baby and determining the spectral values of one or more audio frames in the audio data;

a spectrogram determining module, configured to generate a spectrogram of the audio data according to the spectral values of the one or more audio frames;

and the result obtaining module is used for determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In a third aspect, a method for analyzing pronunciation of a baby is provided, which is applied to a user device, and includes:

sending audio data of the sound made by the baby to a server; receiving a physiological state judgment result corresponding to the audio data fed back by the server; wherein, the judgment result of the physiological state corresponding to the audio data is obtained by the server according to the following mode: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In a fourth aspect, a user equipment is provided, including:

the sending module is used for sending audio data of sound emitted by the baby to the server;

the receiving module is used for receiving a physiological state judgment result corresponding to the audio data fed back by the server; wherein, the judgment result of the physiological state corresponding to the audio data is obtained by the server according to the following mode: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In a fifth aspect, a method for analyzing pronunciation of an infant is provided, which is applied to a machine learning component, and the method includes:

receiving a spectrogram of audio data of baby-uttered sound sent by a server, wherein the spectrogram of the audio data is generated by framing the audio data of baby-uttered sound by the server, determining spectral values of one or more audio frames in the audio data and generating the spectral values according to the spectral values of the one or more audio frames; obtaining a physiological state judgment result corresponding to the audio data by using a machine learning component according to the spectrogram of the audio data; and returning the physiological state judgment result corresponding to the audio data to the server.

In a sixth aspect, a machine learning component is provided, comprising:

the receiving module is used for receiving a spectrogram of audio data of sound emitted by a baby from a server, wherein the spectrogram of the audio data is generated by framing the audio data of the sound emitted by the baby by the server, determining spectral values of one or more audio frames in the audio data and generating the spectral values according to the spectral values of the one or more audio frames;

the physiological state judgment module is used for obtaining a physiological state judgment result corresponding to the audio data by using a machine learning component according to the spectrogram of the audio data;

and the feedback module is used for returning the physiological state judgment result corresponding to the audio data to the server.

In a seventh aspect, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the first, third and fifth aspects when executing the computer program.

In an eighth aspect, there is provided a computer-readable storage medium storing a computer program for executing the method of any one of the first, third, and fifth aspects.

According to the technical scheme, the audio data which are sent out by the baby are framed, and the frequency spectrum values of one or more audio frames in the audio data are determined; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component. Thereby accurately analyzing the physiological state corresponding to the sound emitted by the baby.

In order to make the aforementioned and other objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

fig. 1 is a schematic diagram of an applicable scenario provided in an embodiment of the present specification;

fig. 2 is a schematic interactive flow chart of a method for analyzing an infant pronunciation according to an embodiment of the present disclosure;

FIG. 3 is a schematic spectrogram provided in an example of the present specification;

FIG. 4 is a schematic diagram of a conventional convolutional neural network model provided in an embodiment of the present specification;

fig. 5 is a schematic diagram of a server provided in an embodiment of the present specification;

fig. 6 is a schematic diagram of a user equipment provided in an embodiment of the present specification;

FIG. 7 is a schematic illustration of a machine learning component provided in an embodiment of the present description;

fig. 8 is a flowchart illustrating a method for analyzing pronunciation of an infant according to an embodiment of the present disclosure;

FIG. 9 is a flow chart of another method for analyzing pronunciation of an infant provided in an embodiment of the present disclosure;

FIG. 10 is a flow chart of another method for analyzing pronunciation of an infant provided in an embodiment of the present disclosure;

fig. 11 is an interactive flow diagram of a method for analyzing an infant pronunciation according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the embodiments are provided to explain the embodiments and should not be construed as limiting the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The principles and spirit of the embodiments of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Although the embodiments herein provide the method operation steps or apparatus structures as shown in the following embodiments or figures, more or less operation steps or module units may be included in the method or apparatus based on conventional or non-inventive labor. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution order of the steps or the block structure of the apparatus is not limited to the execution order or the block structure shown in the embodiments or the drawings of the present specification. The described methods or modular structures, when applied in an actual device or end product, may be executed sequentially or in parallel according to embodiments or the methods or modular structures shown in the figures.

The concept of deep learning, which is a method of machine learning based on the learning of data features, is derived from the study of artificial neural networks, and an observation value (e.g., an image) can be represented in various ways, such as a vector of intensity values of each pixel, or more abstractly as a series of edges, a region of a particular shape, and so on. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms. The motivation is to establish and simulate a neural network for analyzing and learning the human brain, and form more abstract high-level representation attribute categories or characteristics by combining low-level characteristics, so as to achieve the purpose of simulating human brain mechanism to explain data in a distributed characteristic representation mode of data.

The embodiments of the present disclosure provide a method and a server for analyzing voice data, which can accurately analyze information contained in the voice data, and for example, can be applied to analyzing crying of a baby to obtain a physiological state and a prejudgment policy corresponding to the crying of the baby. One implementation provided by the embodiment of the specification can be applied to a system framework comprising a client, a server and a machine learning component.

Fig. 1 shows a scenario in which the embodiment is applied, and referring to fig. 1, when an infant cries, a user starts the physiological state prejudging software or function on the user equipment (at least having a recording function), and the user equipment records the environmental sound to obtain the voice of the infant crying. The user equipment is such as smart mobile phone, panel computer, intelligent wearing equipment etc.. The client can be provided with a communication module and can be in communication connection with a remote server to realize data transmission with the server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed system. The server carries out preprocessing according to the voice, transmits a preprocessing result to the deep learning model component, finally obtains a physiological state judgment result corresponding to the voice information and a matched coping strategy, and further returns the physiological state judgment result to the client to be displayed to the user. And the user processes according to the obtained physiological state judgment result and the matched coping strategy and gives feedback on the client.

It should be noted that the embodiments of the present specification can be applied to the scenarios including, but not limited to, the above scenarios.

The following describes a specific embodiment of the present application. Fig. 2 is an interaction diagram illustrating an infant pronunciation analysis method provided by an embodiment of the present disclosure, which is applied to analyze a voice uttered by an infant, so as to obtain a physiological state determination result of the voice uttered by the infant and a matching coping strategy. As shown in fig. 2, the process includes:

step 201: the user equipment transmits collected voice data of the sound made by the baby to the server.

Step 202: the server preprocesses the voice data to obtain a spectrogram of the voice data.

Step 203: the server transmits the spectrogram to the machine learning component.

Step 204: and the machine learning component processes the spectrogram of the voice data to obtain physiological state judgment result information corresponding to the spectrogram.

Step 205: and the machine learning component returns the physiological state judgment result information to the server.

Step 206: and the server matches a coping strategy according to the physiological state judgment result information.

Step 207: and the server returns the physiological state judgment result and the corresponding strategy matched with the physiological state judgment result to the user equipment.

In one or more embodiments of the present specification, in step 202, the server performs fourier transform on one or more speech frames included in the received speech data of the speech uttered by the infant, performs decibel conversion on the obtained amplitude value, and performs normalization processing to obtain a spectral value of one or more audio frames of the audio data. Further, a spectrogram of the speech data is generated based on the spectral values of the one or more audio frames.

In one or more embodiments of the present description, in step 204, the spectrogram can be input to a machine learning component, extracting deep learning features; the spectrogram can be classified by using a classifier according to the deep learning features, so that a physiological state judgment result corresponding to the spectrogram is obtained.

In one or more embodiments of the present description, the machine learning component can conduct training of the neural network model in the machine learning component based on: firstly, acquiring audio data of sound emitted by an infant and a physiological state judgment result of the audio data; framing the audio data, and determining the spectral values of one or more audio frames in the audio data; further generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data, and extracting spectrogram features of the audio data; and finally, training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data. When the infant voice data is collected, an original data sample of the infant crying can be collected, the sample data can be obtained through cooperation with a pediatric hospital, a moon center, a child welfare house and the like, and the sample data can also be continuously collected when parents use mobile equipment or wearable equipment; in addition, the matching of crying data and diagnosis results of doctors can be completed in cooperation with clinics and hospitals during the infant visit. And preprocessing the obtained voice data in advance to obtain a spectrogram, and further performing deep learning to obtain a neural network model.

In one or more embodiments of the present specification, in step 206, the server matches a physiological state determination policy corresponding to the physiological state determination result according to the physiological state determination result information. The decision strategy regarding physiological state can be given in connection with the present medical findings, e.g. a flat and intermittent crying often indicates: "mom, i hungry, i thirsty. "such crying stops after the milk is fed. For example, a violent, sharp cry often indicates: "mom, I am pain! I are uncomfortable. "such pain is often caused by a sudden blow, prick or burn. For example, paroxysmal crying is often caused by intestinal spasm due to various acute intestinal infections or dyspepsia. Acute abdominals such as intussusception often manifest themselves as sudden, restless howling, with pale complexion, symptoms of sweating, and the like. Further, the physiological state judgment result and the physiological state judgment strategy are sent to the user equipment, and the physiological state includes a disease state. The user obtains information through the user equipment, so that the function of assisting in taking care of the baby is achieved. Further, the machine learning model can be continuously adjusted according to the feedback of the user, so that the machine learning component has a more accurate classification effect.

In one or more embodiments of the present specification, after step 207, the server may further receive the physiological status determination result and the correction information corresponding to the policy feedback, which are transmitted by the user equipment; and if the fed-back correction information shows that the physiological state judgment result and the physiological state judgment strategy are not accurate, the server corrects the machine learning component according to the correction information.

The method is described in detail below with reference to some examples of the present disclosure.

As shown in fig. 3, in one or more embodiments of the present disclosure, a spectrogram may be used for speech recognition, and compared with a conventional speech recognition method, the spectrogram can simultaneously present information in a time domain and a frequency domain, and can better express speech information. The displayed graph of the fourier analysis of the speech signal is called a spectrogram (sonogram). The spectrogram is a three-dimensional spectrum and represents a graph of the voice spectrum changing along with time, the vertical axis is frequency, and the horizontal axis is time. The intensity of any given frequency component at a given moment is represented by the gray scale or shade of the tone of the corresponding point. Speech is also known as spectral analysis using spectrogram analysis. The spectrogram displays a large amount of information related to the sentence characteristics of the voice, integrates the characteristics of a spectrogram and a time domain waveform, and obviously displays the change situation of the voice frequency spectrum along with time or a dynamic frequency spectrum.

For example, for a segment of speech signal x (t), it is first framed to become x (m, n); wherein n is the frame length and m is the number of frames. Next, an FFT is performed to obtain X (m, n), and a periodogram Y (m, n) is generated, where Y (m, n) ═ X (m, n) × (m, n)'. Then take 10 log10(Y (M, N)), shift M by time scale M and shift N by frequency scale N. And finally drawing (M, N, 10 log10(Y (M, N)) into a two-dimensional graph to obtain a spectrogram (also can be drawn into a three-dimensional graph).

As shown in fig. 3, the spectrogram shows the signal intensity of different speech bands as a function of time. The place displayed as the stripe is actually the place where the points with dark color are gathered, and the stripe is prolonged along with the time, namely the energy which represents the frequency value in the voice as the horizontal coordinate value of the point is stronger, the proportion in the whole voice is large, and the effect which correspondingly influences the human perception is stronger. However, data in general speech is periodic, so the frequency distribution of energy intensity points is frequency periodic, that is, if there are 300Hz intensity points, the intensity points will also appear at n × 300Hz, so the speech spectrogram is in a stripe shape.

Therefore, in the embodiment of the present specification, the spectral values of one or more speech frames included in the speech data may be obtained by performing fourier transform on the one or more speech frames; generating a spectrogram of the audio data according to the spectral values of the one or more voice frames; the horizontal coordinate of the spectrogram is time corresponding to the voice frame, the vertical coordinate of the spectrogram is frequency components contained in the voice frame, and the coordinate point value of the spectrogram is a frequency spectrum value.

In practical applications, the speech feature extraction method in one or more embodiments of the present specification may also be Mel-Frequency Cepstral Coefficients (MFCCs), and the embodiments of the present specification are not limited herein.

The construction of the neural network model in the machine learning component is further described below. In the Neural Network model according to one or more embodiments of the present specification, it is preferably based on a Convolutional Neural Network (CNN) model, but in practical applications, the machine learning component implemented by the method may also be based on a cyclic Neural Network model, and the embodiments of the present specification are not limited herein.

For the convolutional neural network model, model information of the convolutional neural network and spectrogram samples serving as input data of the convolutional neural network are transmitted from a host end (such as an external memory of the host, a hard disk and the like) to a memory. In the convolutional neural network training process, spectrogram samples are distributed to a group stream processor, the group stream processor performs operations such as convolution, channel selection, reverse error calculation and the like by using a parallel computing technology, and finally obtained results (such as model data generated in the training process) are transmitted back to a memory. In the object identification process, the spectrogram sample to be identified is distributed to the group of stream processors, the group of stream processors carry out operations such as convolution, channel selection and the like by using a parallel computing technology, and finally, the pre-judgment result corresponding to the spectrogram sample is transmitted back to the memory.

In one or more embodiments of the present description, the convolutional neural network model is a pre-trained model, and fig. 4 shows a typical example of a conventional convolutional neural network model, which is used for multi-channel image recognition. In the embodiment applied to the present specification, after the spectrogram is input to the convolutional neural network, the recognition result is finally output through a plurality of processing procedures (such as stages in the graph). One stage of processing may include: convolution, normalization processing, and downsampling. After a spectrogram of a local receptive field (e.g., 5 × 5 pixels) is input to each neuron of the input layer, each neuron performs a convolution process or the like on the spectrogram to obtain different feature information. Compressing the input feature graph through the pooling layer, so that the feature graph is reduced, and the network computation complexity is simplified; on one hand, feature compression is carried out, and main features are extracted. And connecting all the characteristics through a full connecting layer, mapping the learned characteristic representation to a mark space of the sample, and sending an output value to a classifier. The classifier can output the result directly.

In one or more embodiments of the present description, after the data processing of the convolutional neural network, the target feature map is input into a Softmax classifier to separate data, and identification is performed in the Softmax classifier to identify physiological state information corresponding to features. In specific implementation, in the embodiments of the present description, a Softmax classifier is adopted, and a Support Vector Machine (SVM) classifier is also supported.

An embodiment of the present specification further provides a server, as shown in fig. 5, where the server includes:

the preprocessing module 501 is configured to frame audio data that a baby emits sound, and determine spectral values of one or more audio frames in the audio data.

A spectrogram determining module 502, configured to generate a spectrogram of the audio data according to the spectral values of the one or more audio frames.

A result obtaining module 503, configured to determine, from the spectrogram, a physiological state judgment result corresponding to the audio data by using a machine learning component.

In one or more embodiments of the present description, the audio data of the sound emitted by the infant is the audio data of the crying emitted by the infant.

In one or more embodiments of the present description, the preprocessing module 501 is specifically configured to: performing Fourier transform on one or more audio frames in the audio data to obtain amplitude values of the one or more audio frames in the audio data; and carrying out decibel conversion and normalization processing on the amplitude values of one or more audio frames of the audio data to obtain the spectral values of the one or more audio frames of the audio data.

In one or more embodiments of the present description, the result obtaining module 503 is specifically configured to: and inputting the spectrogram of the audio data into a neural network model to obtain a physiological state judgment result corresponding to the audio data.

In one or more embodiments of the present specification, the apparatus further includes a policy matching module, configured to match a coping policy according to the physiological state determination result, and send a physiological state determination result and a coping policy.

In one or more embodiments of the present specification, the apparatus further includes a modification module, configured to receive modification information fed back according to the physiological status determination result and the handling policy; correcting the physiological state judgment result according to the correction information; and sending out a correction result to correct the neural network model according to the correction result.

In one or more embodiments of the present description, the neural network model is trained based on the following steps: acquiring audio data of sound made by the baby and a physiological state judgment result of the audio data; framing the audio data, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data; extracting spectrogram features of the audio data; and training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data.

An embodiment of the present specification further provides a user equipment, as shown in fig. 6, where the user equipment includes:

the sending module 601 is configured to send audio data of sounds made by the baby to the server.

A receiving module 602, configured to receive a physiological state judgment result corresponding to the audio data fed back by the server; wherein, the judgment result of the physiological state corresponding to the audio data is obtained by the server according to the following mode: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In one or more embodiments of the present description, the receiving module 602 is further configured to: and receiving a judgment result fed back by the server according to the physiological state and a matched coping strategy.

In one or more embodiments of the present specification, the sending module 601 is further configured to: and feeding back correction information to the server according to the physiological state judgment result and the matched coping strategy so that the server corrects the physiological state judgment result according to the correction information.

Embodiments of the present specification further provide a machine learning component, as shown in fig. 7, including:

the receiving module 701 is configured to receive a spectrogram of audio data of a sound emitted by a baby from a server, where the spectrogram of the audio data is generated by framing the audio data of the sound emitted by the baby by the server, determining spectral values of one or more audio frames in the audio data, and generating the spectral values according to the spectral values of the one or more audio frames.

A physiological state determining module 702, configured to obtain a physiological state determining result corresponding to the audio data by using a machine learning component according to the spectrogram of the audio data.

The feedback module 703 is configured to return a physiological state determination result corresponding to the audio data to the server.

In one or more embodiments of the present description, the physiological state determination module 702 is specifically configured to: and inputting the spectrogram of the audio data into a neural network model to obtain a physiological state judgment result corresponding to the audio data.

In one or more embodiments of the present description, the method further comprises training the neural network model based on:

acquiring audio data of sound made by the baby and a physiological state judgment result of the audio data; framing the audio data, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data; extracting spectrogram features of the audio data; and training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data.

Fig. 8 illustrates an infant pronunciation analysis method provided by an embodiment of the present specification, which may be applied to a server, and as shown in fig. 8, the method includes the following steps:

step 801: the method comprises the steps of framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data.

Step 802: and generating a spectrogram of the audio data according to the spectral values of the one or more audio frames.

Step 803: and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In one or more embodiments of the present description, the audio data of the sound emitted by the infant may be audio data of crying of the infant.

In one or more embodiments of the present specification, in step 801, fourier transform may be performed on one or more audio frames in the audio data, so as to obtain amplitude values of the one or more audio frames in the audio data; and carrying out decibel conversion and normalization processing on the amplitude values of one or more audio frames of the audio data to obtain the spectral values of the one or more audio frames of the audio data.

In one or more embodiments of the present specification, in step 803, a spectrogram of the audio data may be input into a neural network model, and a physiological state determination result corresponding to the audio data may be obtained.

In one or more embodiments of the present disclosure, a coping strategy may be further matched according to the physiological state determination result, and then the physiological state determination result and the coping strategy are sent out.

In one or more embodiments of the present disclosure, correction information fed back according to the physiological state determination result and the coping strategy may also be received; correcting the physiological state judgment result according to the correction information; and further sending out a correction result for correcting the neural network model according to the correction result.

In one or more embodiments of the present description, the machine learning component can conduct training of the neural network model in the machine learning component based on: firstly, acquiring audio data of sound emitted by an infant and a physiological state judgment result of the audio data; framing the audio data, and determining the spectral values of one or more audio frames in the audio data; further generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data, and extracting spectrogram features of the audio data; and finally, training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data.

Fig. 9 illustrates an infant pronunciation analysis method provided in another embodiment of the present specification, which can be applied to a user device, as shown in fig. 9, and includes the following steps:

step 901: and sending the audio data of the sound emitted by the baby to a server.

Step 902: receiving a physiological state judgment result corresponding to the audio data fed back by the server; wherein, the judgment result of the physiological state corresponding to the audio data is obtained by the server according to the following mode: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

In one or more embodiments of the present specification, after the step 901 and the step 902, the judgment result according to the physiological state and the matching coping strategy fed back by the server may also be received.

In one or more embodiments of the present specification, after receiving the determination result according to the physiological state and the matching coping strategy fed back by the server, the server feeds back correction information according to the determination result according to the physiological state and the matching coping strategy, so that the server corrects the determination result according to the correction information. In this embodiment, the specific steps and functions of the server to implement the correction of the physiological status determination result according to the correction information may be referred to other embodiments of this specification for comparison and explanation, and are not described herein again.

Fig. 10 illustrates an infant pronunciation analysis method provided in another embodiment of the present specification, which can be applied to a machine learning component, as shown in fig. 10, and includes the following steps:

step 1001: the method comprises the steps of receiving a spectrogram of audio data of sound emitted by a baby from a server, wherein the spectrogram of the audio data is generated by the server by framing the audio data of the sound emitted by the baby, determining spectral values of one or more audio frames in the audio data and generating the spectral values according to the spectral values of the one or more audio frames.

Step 1002: and obtaining a physiological state judgment result corresponding to the audio data by using a machine learning component according to the spectrogram of the audio data.

Step 1003: and returning the physiological state judgment result corresponding to the audio data to the server.

In one or more embodiments of the present disclosure, in step 1002, a spectrogram of the audio data may be input into a neural network model, and a physiological state determination result corresponding to the audio data may be obtained.

In one or more embodiments of the present description, the neural network model may be trained based on the following steps:

a. acquiring audio data of sound made by the baby and a physiological state judgment result of the audio data;

b. framing the audio data, and determining spectral values of one or more audio frames in the audio data;

c. generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data;

d. extracting spectrogram features of the audio data;

e. and training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data.

In order to better understand the method for analyzing baby pronunciation described in the embodiments of the present specification, the flow of the method will now be described by way of example, as shown in fig. 11, the method for analyzing baby pronunciation may have the following steps:

step 1101: when the baby cries, the parents start software or functions on the mobile device or wearable device.

Step 1102: the user equipment records the baby cry and compresses the recorded cry voice.

Step 1103: and uploading the processed voice data to a server.

Step 1104: the server preprocesses the audio data, sorts, marks and the like, and generates a spectrogram.

Step 1105: the server transmits the spectrogram of the audio data to the machine learning component.

Step 1106: and the machine learning component processes the spectrogram to obtain a physiological state judgment result corresponding to the audio data.

Step 1107: and the machine learning component returns the physiological state judgment result corresponding to the audio data to the server.

Step 1108: and the server matches a physiological state judgment strategy corresponding to the physiological state judgment result according to the physiological state judgment result.

Step 1109: and the server sends the physiological state judgment result and the physiological state judgment strategy thereof to the user equipment.

Step 1110: after acquiring the information sent in step 1109, the user equipment displays the physiological state judgment result and the physiological state judgment policy thereof on the software or the function.

Step 1111: the user obtains a physiological state judgment result and a physiological state judgment strategy thereof.

Step 1112: and the user practices according to the physiological state judgment result and the physiological state judgment strategy thereof to judge whether the physiological state judgment result and the physiological state judgment strategy thereof are accurate.

Step 1113: and the user feeds back the judgment result to the user equipment.

Step 1114: and the user equipment uploads the judgment result to the server.

Step 1115: and receiving a feedback result of the user equipment, and updating a user judgment result in the feedback result of the user equipment to be physiological state information corresponding to the spectrogram if the feedback result shows that the physiological state judgment result and the physiological state judgment strategy thereof are inaccurate.

Step 1116: transmitting the updated physiological state information to the machine learning component.

Step 1117: the machine learning component modifies the machine learning model based on the updated physiological state information.

Although the baby cannot express his or her needs through speech or gestures, the method provided by one or more embodiments of the present disclosure may effectively analyze the baby's cry, for example, in step 1104, the speech features are extracted by means of the speech spectrogram, further, in step 1105, the server transmits the speech spectrogram to a machine learning component, for example, a 128 × n dimensional speech spectrogram is input into a convolutional neural network model, the convolutional neural network model includes a plurality of convolutional layers and pooling layers, the convolutional layers extract the features of the image, the convolutional layers are calculated by sliding windows one by one on an upper input layer, and the pooling layers abstract the original feature signals, thereby greatly reducing training parameters and reducing the degree of overfitting of the model. The "learned" feature representation is further mapped to the label space of the sample by the fully-connected layer, acting as a "classifier" throughout the convolutional neural network. And finally, classifying the data transmitted by the full connection layer by using a Softmax classifier. Therefore, in step 1106, the machine learning component processes the spectrogram to obtain physiological status information corresponding to the spectrogram.

In step 1108, the server matches the physiological state determination policy corresponding to the physiological state determination result according to the physiological state determination result. The decision strategy regarding physiological state can be given in connection with the present medical findings, e.g. a flat and intermittent crying often indicates: "mom, i hungry, i thirsty. "such crying stops after the milk is fed. For example, a violent, sharp cry often indicates: "mom, I am pain! I are uncomfortable. "such pain is often caused by a sudden blow, prick or burn. For example, paroxysmal crying is often caused by intestinal spasm due to various acute intestinal infections or dyspepsia. Acute abdominals such as intussusception often manifest themselves as sudden, restless howling, with pale complexion, symptoms of sweating, and the like. Furthermore, the physiological state judgment result and the physiological state judgment strategy are sent to the user equipment, and the user acquires information through the user equipment, so that the function of assisting in taking care of the baby is achieved. Further, the machine learning model can be continuously adjusted according to the feedback of the user, so that the machine learning component has a more accurate classification effect.

An embodiment of the present specification further provides a computing device, including: a processor suitable for realizing each instruction and a storage device, wherein the storage device stores a plurality of instructions which are suitable for being loaded by the processor and executing the infant pronunciation analysis method.

The present specification also proposes a computer-readable storage medium, which stores a computer program for executing the above-mentioned infant pronunciation analysis method.

In summary, in the method provided by one or more embodiments of the present specification, the spectral values of one or more audio frames in audio data are determined by framing the audio data of a sound emitted by an infant; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component. Therefore, the physiological state corresponding to the sound emitted by the baby can be accurately analyzed, the physiological state information and the physiological state strategy obtained based on the voice data provided by the user can be provided for the user, and the prevention and treatment of the physiological state of the user are assisted. For example, the baby monitoring system can be used for analyzing crying of babies, acquiring physiological state information and physiological state judgment strategies, and combining wearable equipment or mobile equipment with cloud artificial intelligence, so that the use is more convenient, and the aim of assisting users in caring for the babies is fulfilled.

In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments.

It is noted that while the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not a requirement or suggestion that the operations must be performed in this particular order or that all of the illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Although the present application provides method steps as in embodiments or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.

The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above-mentioned embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An analysis method for infant pronunciation, comprising:

framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data;

generating a spectrogram of the audio data according to the spectral values of the one or more audio frames;

and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

2. The method of claim 1, wherein the audio data of the baby making a sound is audio data of the baby making a crying.

3. The method of claim 1, wherein the framing audio data from which the infant sounds, determining spectral values for one or more audio frames in the audio data, comprises:

performing Fourier transform on one or more audio frames in the audio data to obtain amplitude values of the one or more audio frames in the audio data;

and carrying out decibel conversion and normalization processing on the amplitude values of one or more audio frames of the audio data to obtain the spectral values of the one or more audio frames of the audio data.

4. The method of claim 1, the model of the machine learning component is a neural network model.

5. The method of claim 4, further comprising:

and matching the coping strategy according to the physiological state judgment result, and sending out the physiological state judgment result and the coping strategy.

6. The method of claim 5, further comprising:

receiving a judgment result according to the physiological state and correction information fed back by a coping strategy;

correcting the physiological state judgment result according to the correction information;

and sending out a correction result to correct the neural network model according to the correction result.

7. The method of claim 4, wherein the neural network model is trained based on the steps of:

acquiring audio data of sound made by the baby and a physiological state judgment result of the audio data;

framing the audio data, and determining spectral values of one or more audio frames in the audio data;

generating a spectrogram of the audio data according to the spectral values of one or more audio frames in the audio data;

extracting spectrogram features of the audio data;

and training a neural network model according to the spectrogram characteristics and the physiological state judgment result of the audio data.

8. An infant pronunciation analysis method, comprising:

sending audio data of the sound made by the baby to a server;

receiving a physiological state judgment result corresponding to the audio data fed back by the server; wherein, the judgment result of the physiological state corresponding to the audio data is obtained by the server according to the following mode: framing audio data of sound emitted by a baby, and determining spectral values of one or more audio frames in the audio data; generating a spectrogram of the audio data according to the spectral values of the one or more audio frames; and determining a physiological state judgment result corresponding to the audio data from the spectrogram by using a machine learning component.

9. The method of claim 8, further comprising:

and receiving a judgment result according to the physiological state and a matched coping strategy which are fed back by the server.

10. The method of claim 9, further comprising:

and feeding back correction information to the server according to the physiological state judgment result and the matched coping strategy so as to enable the server to correct the physiological state judgment result according to the correction information.

11. An infant pronunciation analysis method, comprising:

receiving a spectrogram of audio data of baby-uttered sound sent by a server, wherein the spectrogram of the audio data is generated by framing the audio data of baby-uttered sound by the server, determining spectral values of one or more audio frames in the audio data and generating the spectral values according to the spectral values of the one or more audio frames;

obtaining a physiological state judgment result corresponding to the audio data by using a machine learning component according to the spectrogram of the audio data;

and returning the physiological state judgment result corresponding to the audio data to the server.

12. The method of claim 11, in which the machine learning component is a neural network model component.

13. The method of claim 12, further comprising:

training a neural network model based on the following steps:

extracting spectrogram features of the audio data;

14. A server, comprising:

15. The server of claim 14, wherein the audio data of the baby making a sound is audio data of the baby making a crying.

16. The server according to claim 14, wherein the preprocessing module is specifically configured to:

17. The server of claim 14, the model of the machine learning component is a neural network model.

18. The server of claim 17, further comprising:

and the strategy matching module is used for matching the coping strategy according to the physiological state judgment result and sending out the physiological state judgment result and the coping strategy.

19. The server of claim 18, further comprising:

the correction module is used for receiving correction information fed back according to the physiological state judgment result and the coping strategy; correcting the physiological state judgment result according to the correction information; and sending out a correction result to correct the neural network model according to the correction result.

20. The server of claim 17, wherein the neural network model is trained based on:

extracting spectrogram features of the audio data;

21. A user device, comprising:

22. The user device of claim 21, wherein the receiving module is further configured to:

and receiving a judgment result fed back by the server according to the physiological state and a matched coping strategy.

23. The user equipment of claim 22, wherein the sending module is further configured to:

and feeding back correction information to the server according to the physiological state judgment result and the matched coping strategy so that the server corrects the physiological state judgment result according to the correction information.

24. A machine learning component, comprising:

25. The machine learning component of claim 24, wherein the machine learning component is a neural network model component.

26. The machine learning component of claim 25, further comprising:

a training module for training the neural network model based on the following steps:

extracting spectrogram features of the audio data;

27. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 13 when executing the computer program.

28. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 13.