WO2023008260A1 - 情報処理システム、情報処理方法、及び情報処理プログラム - Google Patents
情報処理システム、情報処理方法、及び情報処理プログラム Download PDFInfo
- Publication number
- WO2023008260A1 WO2023008260A1 PCT/JP2022/028075 JP2022028075W WO2023008260A1 WO 2023008260 A1 WO2023008260 A1 WO 2023008260A1 JP 2022028075 W JP2022028075 W JP 2022028075W WO 2023008260 A1 WO2023008260 A1 WO 2023008260A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound
- information
- sound information
- output
- stationary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
- G10L2025/937—Signal energy in various frequency bands
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2227/00—Details of public address [PA] systems covered by H04R27/00 but not provided for in any of its subgroups
- H04R2227/005—Audio distribution systems for home, i.e. multi-room use
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R29/00—Monitoring arrangements; Testing arrangements
- H04R29/007—Monitoring arrangements; Testing arrangements for public address systems
Definitions
- the present disclosure relates to technology for estimating user behavior from sound.
- Patent Document 1 classifies sound detected by a microphone into either TV sound or real environment sound, specifies the sound source of the sound classified as real environment sound, and estimates the behavior of the home user based on the specified result.
- a behavior estimation device that
- Patent Document 1 does not take into consideration the application of the behavior estimation device to a network environment such as a cloud, so further improvements are necessary to reduce the load on the network.
- the present disclosure has been made to solve such problems, and is to provide a technology that can reduce the load on the network.
- An information processing system is an information processing system in which a terminal and a computer are connected via a network, wherein the terminal includes a sound collector that collects sound, and the collected sound is input to a first trained model that estimates whether the sound indicated by the sound information is a stationary sound or a non-stationary sound, and the sound information is estimated to be the non-stationary sound a first estimator that outputs sound information estimated as the non-stationary sound to the computer via the network as output sound information, the computer includes an acquisition unit that acquires the output sound information; estimating an output result obtained by inputting the output sound information acquired by the acquisition unit into a second trained model indicating the relationship between the output sound information and the action information related to the action of the person as the action of the person; 2 estimators.
- FIG. 1 is a block diagram showing an example of a configuration of an information processing system according to Embodiment 1 of the present disclosure
- FIG. FIG. 10 is a diagram showing how an autoencoder configuring a first trained model performs machine learning
- FIG. 10 is a diagram showing how an autoencoder making up the first trained model performs estimation
- It is a figure which shows the 1st example of the image information of a spectrogram.
- FIG. 4 is a diagram showing a first example of image information of frequency characteristics
- FIG. 10 is a diagram showing a second example of image information of a spectrogram
- FIG. 10 is a diagram showing a second example of image information of frequency characteristics
- FIG. 11 is a diagram showing a third example of image information of a spectrogram
- FIG. 10 is a diagram showing a third example of image information of frequency characteristics;
- FIG. 10 is a diagram showing a fourth example of image information of a spectrogram;
- FIG. 11 is a diagram showing a fourth example of image information of frequency characteristics;
- FIG. 10 is a diagram showing how a convolutional neural network forming a second trained model performs machine learning;
- FIG. 10 is a diagram showing how a convolutional neural network forming a second trained model performs estimation;
- 4 is a flowchart showing an example of processing of the information processing system according to Embodiment 1 of the present disclosure;
- FIG. 5 is a diagram showing an example of threshold setting processing used when a terminal determines whether a stationary sound or a non-stationary sound.
- FIG. 4 is a flow chart showing an example of processing of an information processing system when a server transmits a control signal to a device;
- FIG. 11 is a flowchart showing an example of processing when the first trained model is re-learned;
- FIG. 11 is a flow chart showing an example of processing when the second trained model is re-learned;
- FIG. It is a block diagram showing an example of a configuration of an information processing system according to Embodiment 2 of the present disclosure.
- FIG. 4 is an explanatory diagram of frequency conversion processing;
- FIG. 15 is a flowchart showing an example of details of the process of step S14 of FIG. 14 in Embodiment 2 of the present disclosure;
- FIG. 11 is a block diagram showing an example of a configuration of an information processing system according to Embodiment 3 of the present disclosure
- FIG. FIG. 11 is a block diagram showing an example of a configuration of an information processing system according to Embodiment 4 of the present disclosure
- FIG. FIG. 11 is an explanatory diagram of Modification 3 of the present disclosure
- FIG. 11 is an explanatory diagram of Modification 4 of the present disclosure
- the audible range of sound collected in a home is susceptible to various noises, and it is difficult to say that human behavior can be estimated with high accuracy. Therefore, the use of sound in the ultrasonic band, which is less susceptible to noise, for behavior estimation is also under study.
- the amount of data transmitted to the network becomes much larger than when only audible sounds are used, and the network is also heavily loaded. This is because the ultrasonic band has a wider frequency band than the audible band, so the amount of data is large, and because the ultrasonic band has a higher frequency than the audible band, it is necessary to set a short sampling period.
- the present inventors have proposed a two-stage configuration for action estimation, consisting of a terminal and a computer connected to the terminal via a network. is output to a computer, and the computer performs behavior estimation based on the non-stationary sound, the load on the network, the terminal, and the computer can be reduced. rice field.
- An information processing system is an information processing system in which a terminal and a computer are connected via a network, the terminal includes a sound collector for collecting sound, and inputting the sound information indicating the sound to a first trained model that estimates whether the sound indicated by the sound information is a stationary sound or a non-stationary sound, and estimating that the sound information is the non-stationary sound.
- a first estimator that outputs sound information estimated as the non-stationary sound to the computer via the network as output sound information when the non-stationary sound is estimated
- the computer is an acquisition unit that acquires the output sound information and an output result obtained by inputting the output sound information acquired by the acquisition unit to a second trained model indicating the relationship between the output sound information and behavior information related to human behavior as the behavior of the person. and a second estimator for estimating.
- the sound information indicating the sound picked up by the sound collector is input to the first trained model, and whether the sound is stationary or non-stationary is estimated, and the non-stationary sound is estimated.
- sound information indicating non-stationary sound is output as output sound information from the terminal to the computer via the network, and the computer estimates human behavior from the output sound information.
- the terminal does not output all sound information picked up by the sound pickup device to the computer, but only the sound information indicating the non-stationary sound.
- the volume is reduced and the load on the network can be reduced.
- the output sound information may be image information of a spectrogram of the sound picked up by the sound pickup device or image information of frequency characteristics.
- the sound information output from the first estimation unit is image information of the spectrogram of the sound or image information of the frequency characteristics, so the time-series data of the sound pressure picked up by the sound pickup device is transmitted.
- the amount of data of sound information to be output to the network can be greatly reduced compared to the case where
- the first estimator detects sound in a first frequency band, which is a frequency band with a maximum sound pressure level, from the sound information estimated as the non-stationary sound. extracting information, converting the extracted sound information of a first frequency band into sound information of a second frequency band that is a lower frequency band than the first frequency band, and converting the converted sound information of the second frequency band to the sound information of the second frequency band; It may be generated as output sound information.
- the sound information in the first frequency band is extracted from the sound information indicating the non-stationary sound, the extracted sound information is converted into the sound information in the second frequency band lower than the first frequency band, and the conversion is performed. Since the sound information of the second frequency band is output from the terminal to the computer as the output sound information, the output transmitted to the network is compared to the case of transmitting the time-series data of the sound pressure picked up by the sound pickup device. The data amount of sound information can be greatly reduced.
- the output sound information may include additional information indicating the range of the first frequency band.
- the computer since the incidental information indicating the first frequency band is output from the terminal to the computer together with the sound information of the second frequency band, the computer can specify the first frequency band using the incidental information, Accuracy of action estimation can be improved.
- the second trained model is a machine-learned model of the relationship between the second frequency band sound information and the incidental information and the behavior information.
- the second trained model is a model obtained by machine-learning the relationship between the sound information and incidental information in the second frequency band and the action information. Behavior can be estimated with high accuracy.
- the first frequency band is an ultrasonic wave having a maximum sound pressure level among a plurality of predetermined frequency bands. It may be a frequency band of bands.
- the sound information of the frequency band of the ultrasonic band containing the most non-stationary sounds in the plurality of predetermined frequency bands is extracted as the sound information of the first frequency band. Sound information can be easily extracted.
- the sound information is The indicated sound is estimated to be the non-stationary sound, and the threshold may be changed so that the frequency of the non-stationary sound is estimated to be equal to or lower than a reference frequency.
- the threshold of the estimation error of the first trained model is changed so that the frequency of estimated non-stationary sounds is equal to or less than the reference frequency, so the load on the network can be further reduced.
- determination is made as to whether or not the output result by the second trained model is an error, and the determination result is indicated.
- a determining unit that inputs result information to the second estimator, wherein the second estimator responds to the output result when the determination result information indicating that the output result is correct is input.
- the output sound information may be used to relearn the second trained model.
- the second trained model when the determination result information indicating that the output result of the second trained model is correct is input, the second trained model re-learns using the output sound information corresponding to the output result. Therefore, the estimation accuracy of the second trained model can be improved.
- the determination unit inputs to the device a control signal for controlling the device according to the behavior information indicating the behavior estimated by the second estimator, and It may be determined that the output result is erroneous when an instruction to cancel the control indicated by the control signal is obtained from the device.
- the second estimator when the determination result information is input, the second estimator outputs the determination result information to the terminal via the network.
- the first estimator uses the sound information estimated as the stationary sound by the first trained model to The first trained model may be retrained.
- the first trained model is re-learned using the sound information estimated to be stationary sound, so the estimation accuracy of the first trained model can be improved.
- the sound information may include sound information of environmental sound of a space in which the sound collector is installed.
- the sound information acquired by the sound pickup device may include sound in an ultrasonic band.
- the user's behavior is estimated using the sound information in the ultrasonic band, it is possible to improve the estimation accuracy of the user's behavior. Furthermore, the amount of data of sound information in the ultrasonic band is much larger than that of sound information in the audible band. and the load on the computer can be reduced.
- the first estimator includes the sound information indicating the sound picked up by the sound pickup device. extracting sound information in a plurality of first frequency bands from the sound information in a second frequency band that is the lowest first frequency band among the plurality of first frequency bands It may be converted into information, synthesized with a plurality of converted sound information of the second frequency band, and the synthesized sound information may be generated as the output sound information.
- the sound information indicating the non-stationary sound compressed by frequency conversion is output to the computer, so the amount of data flowing through the network can be further reduced.
- the first estimator from the sound information estimated as the non-stationary sound, extracting sound information in a first frequency band including the non-stationary sound in the first frequency band, and extracting the extracted sound information in the first frequency band in the lowest first frequency band among the plurality of first frequency bands;
- the sound information may be converted into sound information in the second frequency band, the converted sound information in the second frequency band may be synthesized, and the synthesized sound information may be generated as the output sound information.
- the sound information in the first frequency band including the non-stationary sound is extracted, and the extracted sound information is compressed in the second frequency region and transmitted to the computer. can be further reduced.
- An information processing method is an information processing method in an information processing system in which a terminal and a computer are connected via a network, wherein the terminal collects sound, inputting sound information indicating the produced sound to a first trained model for estimating whether the sound indicated by the sound information is a stationary sound or a non-stationary sound, wherein the sound information is the non-stationary sound; the sound information estimated to be the non-stationary sound is output to the computer via the network as output sound information, the computer acquires the output sound information, and the output sound information and the human The output result obtained by inputting the acquired output sound information into a second learned model indicating the relationship with the action information related to the action of the person is estimated as the action of the person.
- An information processing program for an information processing system in which a terminal and a computer are connected via a network, wherein the terminal collects sound, sound information indicating the collected sound is input to a first trained model that estimates whether the sound indicated by the sound information is a stationary sound or a non-stationary sound, and the sound information indicates the non-stationary sound; causing the computer to output the sound information estimated as the non-stationary sound as output sound information to the computer via the network when the sound is estimated as a sound, and causing the computer to acquire the output sound information; executing a process of estimating an output result obtained by inputting the acquired output sound information into a second trained model indicating the relationship between the output sound information and action information related to human action as the action of the person;
- the present disclosure can distribute such an information processing program via a computer-readable non-temporary recording medium such as a CD-ROM or a communication network such as the Internet.
- FIG. 1 is a block diagram showing an example of the configuration of an information processing system 1 according to Embodiment 1 of the present disclosure.
- the information processing system 1 includes a terminal 2 and a server 3 (an example of a computer).
- the terminal 2 is installed in a house 6 where the user whose behavior is estimated resides.
- Terminal 2 and server 3 are connected via network 5 so as to be able to communicate with each other.
- An example of the installation location of the terminal 2 is the hallway, stairs, entrance, room, etc. of the house 6 .
- An example of a room is a dressing room, kitchen, closet, living room, and dining room.
- the network 5 is a public communication line including, for example, the Internet and a mobile phone communication network.
- the server 3 is, for example, a cloud server located on the network 5 .
- the device 4 is installed in a house 6 and operates according to a control signal according to the user's behavior estimated by the server 3 .
- the terminal 2 and the device 4 are installed in the residence 6, but this is an example, and they may be installed in facilities such as factories or offices.
- the terminal 2 is, for example, a stationary computer.
- the terminal 2 includes a microphone 21 (an example of a sound collector), a first processor 22 (an example of a first estimator), a communication device 23 and a memory 24 .
- the microphone 21 is sensitive to, for example, sound in the audible band (audible sound) and sound in the ultrasonic band (inaudible sound). Therefore, sounds picked up by the microphone 21 include audible sounds and non-audible sounds.
- An example of an audible band is 0-20 kHz. Inaudible sound is sound in a frequency band of 20 kHz or higher.
- the microphone 21 may be a microphone having sensitivity only in the ultrasonic band.
- An example of the microphone 21 is a MEMS (Micro Electronics Mechanical System) microphone.
- the microphone 21 picks up audible sounds and non-audible sounds generated by actions of a user (an example of a person) present in the house 6 .
- the microphone 21 converts the collected sound into an electrical signal to generate a sound signal, and inputs the generated sound signal to the first estimation unit 2
- Examples of objects that exist in the residence 6 are housing equipment, home appliances, furniture, and daily necessities.
- Examples of residential fixtures are taps, showers, stoves, windows, doors, and the like.
- Examples of home appliances include washing machines, dishwashers, vacuum cleaners, air conditioners, blowers, lighting equipment, hair dryers, and televisions.
- Examples of furniture are desks, chairs, beds, and the like.
- Examples of household items are trash cans, storage boxes, umbrella stands, pet supplies, and the like.
- the first processor 22 is configured by a central processing unit, for example, and includes a first estimator 221 .
- the first estimator 221 is implemented by the central processing unit implementing an information processing program. However, this is only an example, and the first estimation unit 221 may be configured with a dedicated hardware circuit such as an ASIC.
- the first estimating unit 221 inputs the sound information indicating the sound picked up by the microphone 21 to the first trained model 241 to determine whether the sound indicated by the sound information is a stationary sound or a non-stationary sound. If the sound is estimated to be a non-stationary sound, output sound information for outputting the sound information estimated to be the non-stationary sound is generated, and the generated output sound information is output to the server 3 using the communication device 23 .
- the first trained model 241 is a trained model created in advance for estimating whether the sound indicated by the sound information is a steady sound or a non-steady sound. An example of the first trained model 241 is an autoencoder.
- the sound information is information having a predetermined time width in which digital sound pressure data AD-converted at a predetermined sampling period are arranged in time series.
- the first estimation unit 221 repeats the process of generating sound information while the sound signal is being input from the microphone 21 .
- the input sound signal may include a silent sound signal.
- Steady sounds include environmental sounds that are always generated in the house 6.
- Environmental sounds include vibration sounds of household equipment and electric appliances that are always in operation.
- An example of environmental sound is the vibration sound of a refrigerator.
- Non-stationary sounds are sounds that occur less frequently than stationary sounds, and include sounds that occur in association with human actions. Examples of non-stationary sounds include the sound of opening and closing the refrigerator door, the sound of the user walking in a hallway, the sound of running water from the faucet, the sound of clothes rubbing, and the sound of the user combing his hair.
- FIG. 2 is a diagram showing how the autoencoder 500 that configures the first trained model 241 performs machine learning.
- autoencoder 500 includes an input layer 501 , an intermediate layer 502 and an output layer 503 .
- the intermediate layer 502 includes three layers, and the autoencoder 500 is composed of a total of five layers, but this is an example and the number of intermediate layers 502 may be one. , may be four or more.
- Both the input layer 501 and the output layer 503 have 36 nodes. Both the first and third hidden layers 502 have 18 nodes. The second hidden layer 502 has 9 nodes.
- the 36 nodes of the input layer 501 and the output layer 503 are assigned 36 frequency bands obtained by dividing the frequency band from 20 kHz to 96 kHz into 1.9 kHz intervals. Specifically, each node of the input layer 501 and the output layer 503 has 94.1 to 96 kHz, 92.2 to 94.1 kHz, .
- the frequency bands are allocated as follows. Sound pressure data in the assigned frequency band is input to each node of the input layer 501 as sound information, and sound pressure data in the assigned frequency band is output from each node of the output layer 503 as sound information. .
- An example of teacher data used for machine learning of the autoencoder 500 is sound information indicating stationary sounds collected in advance in the house 6 .
- Sound information indicating a stationary sound input to each node of the input layer 501 is successively dimensionally compressed through the first intermediate layer 502 and the second intermediate layer 502, and passes through the third intermediate layer 502 and the output layer 503. restored to its original dimension.
- the autoencoder 500 performs machine learning so that sound pressure data output from each node of the output layer 503 is equal to sound pressure data input to each node of the input layer 501 .
- the autoencoder 500 performs such machine learning using a large amount of sound information representing stationary sounds. Note that the number of nodes in each layer shown in FIG. 2 is not limited to the number described above, and various numbers can be adopted. Also, the values of the frequency bands assigned to the input layer 501 and the output layer 503 are not limited to the values described above, and various values are adopted.
- the memory 24 stores a learned model 241 pre-created through such machine learning.
- the trained model 241 is composed of the autoencoder 500, but the present disclosure is not limited to this, and any machine learning model that can machine-learn stationary sounds can be used. may be adopted.
- Another example of the trained model 241 is a convolutional neural network (CNN).
- the first trained model 241 is composed of a convolutional neural network
- sound information indicating stationary sounds is labeled as stationary sounds
- sound information indicating non-stationary sounds is labeled as non-stationary sounds. machine learning.
- FIG. 3 is a diagram showing how the autoencoder 500 making up the first trained model 241 performs estimation.
- the first estimating unit 221 converts the input time-domain sound information into frequency-domain sound information by performing a Fourier transform.
- the first estimation unit 221 divides the sound information in the frequency domain into frequency bands assigned to each node of the input layer 501, and inputs the sound information (sound pressure data) divided into the frequency bands to each node.
- the first estimation unit 221 calculates an estimation error between the sound information output from each node of the output layer 503 and the sound information input to each node of the input layer 501 .
- estimation error is cross-entropy error.
- the first estimation unit 221 determines whether or not the estimation error is equal to or greater than the threshold. Then, the first estimation unit 221 determines that the input sound information is non-stationary sound if the estimation error is greater than or equal to the threshold, and that the input estimation error is stationary sound if the estimation error is less than the threshold. I judge.
- the estimation error is not limited to the cross-entropy error, and mean squared error, mean absolute error, square root of mean squared error, and mean squared logarithmic error, etc. may be employed.
- the output layer is, for example, a first node composed of a softmax function to which stationary sounds are assigned and a second node composed of softmax functions to which non-stationary sounds are assigned. node.
- the first estimating unit 221 estimates that the sound is stationary when the output value of the first node is greater than the output value of the second node, and determines that the sound is stationary when the output value of the second node is greater than the output value of the first node. It is enough to estimate that it is a stationary sound.
- the first estimation unit 221 estimates that the input sound information is a non-stationary sound, it generates image information indicating the characteristics of this sound information as output sound information.
- image information is spectrogram image information or frequency characteristic image information.
- the image information of the spectrogram is, for example, an image in which the temporal change of the sound pressure data in the frequency domain is displayed in shades, with one coordinate axis of a two-dimensional coordinate space being time and the other coordinate axis being frequency.
- the frequency characteristic image information is an image obtained by Fourier transforming sound information.
- the image information of the frequency characteristics is, for example, a two-dimensional coordinate space in which one coordinate axis is frequency and the other coordinate axis is sound pressure data. It is image information composed of pixels to which different pixel values are given in the area other than the area.
- FIG. 4 and 5 are diagrams showing a first example of image information.
- FIG. 4 is spectrogram image information
- FIG. 5 is frequency characteristic image information.
- the image information of the first example shows the characteristics of the sound generated when a person undresses and puts on clothes.
- the clothing material is cotton.
- each pixel has a pixel value corresponding to sound pressure data. This also applies to FIGS. 6, 8 and 10.
- FIG. 4 the horizontal axis is time (seconds), the vertical axis is frequency (Hz), and each pixel has a pixel value corresponding to sound pressure data. This also applies to FIGS. 6, 8 and 10.
- FIG. 4 the horizontal axis is time (seconds), the vertical axis is frequency (Hz), and each pixel has a pixel value corresponding to sound pressure data. This also applies to FIGS. 6, 8 and 10.
- five characteristic signals (1) to (5) are detected in the frequency band of 20 kHz or higher.
- Signals (1) and (2) are above 80 kHz
- signals (3) and (4) are below 80 kHz
- signal (5) is below 70 kHz.
- the signal intensity below 50 kHz is large.
- the horizontal axis is frequency (Hz), and the vertical axis is sound pressure intensity. This also applies to FIGS. 7, 9 and 11.
- FIG. 5 the intensity of the frequency component in the frequency band of 20 kHz to 50 kHz is large in the frequency component of 20 kHz or higher.
- Actions estimated from the image information in the first example are, for example, "undressing” or “changing clothes”.
- FIG. 6 and 7 are diagrams showing a second example of image information.
- FIG. 6 is spectrogram image information
- FIG. 7 is frequency characteristic image information.
- the image information of the second example shows the characteristics of sounds generated when a person walks along a wooden corridor. Specifically, the image information of the second example indicates the characteristics of sounds generated when a person walks barefoot in a hallway.
- a plurality of characteristic signals are detected in the frequency band of 20 kHz to 50 kHz, especially 20 kHz to 35 kHz.
- the intensity of frequency components in the frequency band from 20 kHz to 40 kHz increases in frequency components above 20 kHz.
- the behavior estimated from the image information in the second example is, for example, "walking".
- FIG. 8 and 9 are diagrams showing a third example of image information.
- FIG. 8 is spectrogram image information
- FIG. 9 is frequency characteristic image information.
- the image information of the third example shows the characteristics of the sound generated when a small amount of water is poured from the faucet.
- signals corresponding to the sound of running water are detected between 0 and 6 seconds.
- a continuous signal is detected from around 20 kHz to around 35 kHz, and a plurality of signals exceeding 40 kHz are detected between the continuous signals.
- the intensity of the frequency components in the frequency band from around 20 kHz to 35 kHz increases in the frequency components above 20 kHz.
- the action estimated from the image information in the third example is, for example, "washing hands”.
- FIG. 10 and 11 are diagrams showing a fourth example of image information.
- FIG. 10 is spectrogram image information
- FIG. 11 is frequency characteristic image information.
- the image information of the fourth example indicates the characteristics of sounds related to inaudible sounds generated when hair is combed.
- characteristic signals are detected in the frequency band from 20 kHz to 60 kHz.
- the intensity of the frequency components in the frequency band from 20 kHz to 50 kHz is large in the frequency components of 20 kHz or higher.
- An action that is estimated from the image information in the fourth example is, for example, "combing hair”.
- the amount of data can be greatly reduced compared to the case of outputting time-series data of sound pressure.
- the amount of data may be on the order of tens of megabytes, but when outputting image information, it is possible to reduce the amount of data to several hundred kilobytes or less. It is reduced to the order of 1/100.
- the first estimation unit 221 stores the sound information input to the first trained model 241 in association with the estimation result in the memory 24, and periodically re-learns the first trained model 241 using the accumulated sound information. .
- the first estimation unit 221 changes the threshold so that the frequency of non-stationary sounds estimated in the first trained model 241 is equal to or lower than the reference frequency.
- the communication device 23 is a communication circuit that connects the terminal 2 to the network 5 .
- the communication device 23 transmits output sound information to the server 3 and receives determination result information, which will be described later, from the server 3 .
- the communication device 23 transmits output sound information using a predetermined communication protocol such as MQTT (Message Queueing Telemetry Transport).
- the memory 24 is, for example, a rewritable non-volatile semiconductor memory such as a flash memory, and stores the first trained model 241 and sound information estimated by the first trained model 241 .
- the above is the configuration of terminal 2. Next, the configuration of the server 3 will be explained.
- the server 3 includes a communication device 31 (an example of an acquisition unit), a second processor 32 and a memory 33 .
- a communication device 31 is a communication circuit that connects the server 3 to the network 5 .
- the communication device 31 receives output sound information from the terminal 2 and receives determination result information described later from the server 3 .
- the second processor 32 is composed of a central processing unit, for example, and includes a second estimator 321 (an example of a second estimator) and a determination unit 322 .
- the second estimation unit 321 and the determination unit 322 are realized by executing a predetermined information processing program by the central processing unit.
- the second estimation unit 321 and the determination unit 322 may be configured by dedicated hardware circuits such as ASIC.
- the second estimation unit 321 estimates the output result obtained by inputting the output sound information to the second trained model 331 as the behavior of the user.
- the second trained model 331 is a model constructed by performing machine learning on one or more data sets consisting of pairs of output sound information and action information related to human actions corresponding to the output sound information as teacher data.
- the output sound information is the image information of the spectrogram or the image information of the frequency characteristics described above.
- An example of the data format of these image information is JPEG (Joint Photographic Experts Group) or BMP (Basic Multilingual Plane).
- the output sound information may be sound information composed of time-series data of sound pressure having a certain time width.
- the teacher data of the second trained model 331 is one or more data sets of sound information and action information.
- An example of the data format of the sound information in this case is WAV (Waveform Audio File Format).
- An example of the second trained model 331 is a convolutional neural network, a recurrent neural network (RNN) such as a long short term memory (LSTM), or an attention mechanism.
- RNN recurrent neural network
- LSTM long short term memory
- FIG. 12 is a diagram showing how the convolutional neural network 600 forming the second trained model 331 performs machine learning.
- Convolutional neural network 600 includes input layer 601 , convolutional layer 602 , pooling layer 603 , convolutional layer 604 , pooling layer 605 , fully connected layer 606 , and output layer 607 . Since the convolutional neural network 600 is well known, detailed description thereof will be omitted.
- Each node that configures the output layer 607 is assigned an action to be estimated, and is composed of, for example, a softmax function.
- the output sound information is converted to input data and input to the input layer.
- An example of input data is data obtained by one-dimensionally arranging each pixel value of image information of a spectrogram or frequency characteristics. Each pixel value forming the input data is input to each node forming the input layer 601 .
- Input data input to the input layer 601 is sequentially processed in each layer (602 to 607) and output from the output layer 607.
- FIG. The output result from the output layer 607 is compared with action information, which is teacher data, and the error between the output result and the teacher data is calculated using an error function. machine-learned.
- FIG. 13 is a diagram showing how the convolutional neural network 600 making up the second trained model 331 performs estimation.
- the second estimation unit 321 converts the output sound information output from the terminal 2 into input data, and inputs the input data to each node of the input layer 601 . Input data input to the input layer 601 is sequentially processed in each layer (602 to 607) and output from the output layer 607.
- FIG. The second estimating unit 321 estimates the action assigned to the node that outputs the maximum output value among the output values of the nodes output from the output layer 607 as the action of the user. Examples of inferred actions are "undressing", “changing clothes”, “walking”, “washing hands”, and “combing hair”.
- the determination unit 322 determines whether or not the output result of the second trained model 331, that is, the behavior information indicating the behavior estimated by the second estimation unit 321 is incorrect, and outputs determination result information indicating the determination result. Input to the second estimation unit 321 .
- the determination result information includes determination result information indicating that the estimated behavior is correct and determination result information indicating that the estimated behavior is incorrect.
- the determination unit 322 inputs a control signal for controlling the device 4 according to the behavior estimated by the second estimation unit 321 to the device 4 using the communication device 31, and the control signal is input within a reference period after the input. is obtained from the device 4 using the communication device 31 , the output result is determined to be erroneous, and the determination result information indicating the error is input to the second estimating section 321 .
- the determination unit 322 if the determination unit 322 does not acquire the cancellation instruction within the reference period after inputting the control signal to the device 4 , the determination unit 322 inputs determination result information indicating correctness to the second estimation unit 321 .
- the content of the control indicated by the control signal output by the determination unit 322 is predetermined according to the estimated behavior.
- the second estimating unit 321 acquires the output sound information corresponding to the output result from the memory 33, and uses the acquired output sound information to perform the second learning.
- the model 331 is retrained.
- the device 4 After operating the device 4 with a control signal corresponding to the estimated behavior, if the user inputs an instruction to change the control within the reference period to the device 4, there is a high possibility that the estimated behavior is erroneous. In this case, the device 4 outputs to the server 3 a cancellation instruction for notifying the server 3 that the control has been cancelled.
- the determination unit 322 to which this cancellation instruction is input determines that the action corresponding to the cancellation instruction is erroneous.
- the output sound information input to the server 3, the original sound information of the output sound information, the action information indicating the action estimated from the output sound information, the control signal generated according to the action information, and the cancellation of the control signal The instructions are given the same identifier. This enables the determination unit 322 to identify corresponding information among these pieces of information.
- the control of the device 4 differs depending on the type of the device 4 and the estimated behavior. For example, when the device 4 is a lighting device and the estimated behavior is "walking”, control is performed to turn on the lighting device. For example, if the device 4 is a hair dryer and the estimated action is "to comb hair”, control is performed to operate the hair dryer. For example, if the device 4 is a lighting device in the washroom and the estimated action is "washing hands”, control is performed to turn on the lighting device in the washroom. For example, if the device 4 is an air conditioner and the estimated behavior is "walking,” control is performed to operate the air conditioner.
- the memory 33 is composed of a nonvolatile rewritable storage device such as a hard disk drive and a solid state drive, and stores the second trained model 331 and the output sound information etc. input to the second trained model 331 . Note that the output sound information is stored in association with the determination result information.
- FIG. 14 is a flowchart showing an example of processing of the information processing system 1 according to Embodiment 1 of the present disclosure. Note that the processing of the terminal 2 is repeatedly executed.
- the first estimation unit 221 acquires sound information having a predetermined time width by AD-converting the sound signal input from the microphone 21 .
- step S12 the first estimation unit 221 inputs sound information to the first trained model 241, and estimates whether the input sound information is stationary sound or non-stationary sound.
- the first estimator 221 calculates the estimation error between the sound information input to the first trained model 241 and the sound information output from the first trained model 241. By comparing with a threshold, it is estimated whether the sound is stationary or non-stationary.
- step S13 when the first estimation unit 221 estimates that the input sound information is non-stationary sound (YES in step S13), it generates output sound information from the input sound information (step S14).
- step S13 if it is estimated that the input sound information is a stationary sound (NO in step S13), the process returns to step S11.
- step S ⁇ b>15 the first estimation unit 221 uses the communication device 23 to output the output sound information to the server 3 .
- step S21 the communication device 31 acquires output sound information.
- step S ⁇ b>22 the second estimation unit 321 inputs the output sound information to the second trained model 331 to estimate the behavior of the user.
- step S ⁇ b>23 the determination section 322 generates a control signal according to the action estimated by the second estimation section 321 .
- step S ⁇ b>24 the determination unit 322 outputs the control signal to the device 4 using the communication device 31 .
- step S31 the device 4 acquires the control signal.
- step S32 the device 4 operates according to the control signal.
- the device 4 is controlled according to the behavior estimated by the server 3.
- FIG. 15 is a diagram showing an example of threshold setting processing used when the terminal 2 determines whether the sound is a non-stationary sound or a stationary sound. This flowchart is executed, for example, every predetermined period. Examples of the predetermined period are 1 hour, 6 hours, 1 day, etc., and are not particularly limited.
- the first estimation unit 221 calculates the frequency of outputting the output sound information.
- the first estimation unit 221 stores log information indicating whether the result of estimating the sound information is stationary sound or non-stationary sound in the memory 24, and calculates the frequency using this log information. Just do it.
- the frequency is defined, for example, by the total number of non-stationary sound information items with respect to the total number of sound information items input to the first trained model 241 during the period from the previous frequency calculation to the present.
- the log information has, for example, a data structure in which an estimated time, an estimation result, and an identifier of sound information are associated with each other.
- step S52 the first estimation unit 221 determines whether or not the frequency is greater than or equal to the reference frequency. If the frequency is greater than or equal to the reference frequency (YES in step S52), the first estimator 221 increases the threshold by a predetermined value (step S53). On the other hand, if the frequency is less than the reference frequency (NO in step S52), the process ends. A predetermined value is adopted as the reference frequency in consideration of the network load. As a result, when the frequency is equal to or higher than the reference frequency, the threshold is increased by a predetermined value, the number of times the sound information is estimated to be non-stationary sound gradually decreases, and the number of times the output sound information is output gradually decreases. As a result, the frequency gradually approaches the reference frequency.
- FIG. 16 is a flowchart showing an example of processing of the information processing system 1 when the server 3 transmits a control signal to the device 4.
- FIG. 16 is a flowchart showing an example of processing of the information processing system 1 when the server 3 transmits a control signal to the device 4.
- step S ⁇ b>71 the determination unit 322 generates a control signal according to the behavior estimated by the second estimation unit 321 and outputs the generated control signal to the device 4 using the communication device 31 .
- step S81 the device 4 acquires the control signal.
- step S82 the device 4 executes control indicated by the control signal.
- step S83 the device 4 determines whether or not it has received an instruction from the user to change the control within a reference period after executing the control. If the instruction is received within the reference period (YES in step S83), the device 4 generates a cancellation instruction and outputs the generated cancellation instruction to the server 3 (step S84). On the other hand, if the instruction is not received within the reference period (NO in step S83), the process ends.
- step S72 the determination unit 322 of the server 3 determines whether or not a cancellation instruction has been obtained within a reference period after outputting the control signal. If the cancellation instruction is acquired within the reference period (YES in step S72), the determination unit 322 generates determination result information indicating that the behavior estimated by the second estimation unit 321 is incorrect (step S73). On the other hand, if the cancellation instruction is not acquired within the reference period (NO in step S72), the determination unit 322 generates determination result information indicating that the action estimated by the second estimation unit 321 is correct (step S74).
- step S75 the second estimation unit 321 stores the determination result information and the output sound information corresponding to the determination result information in the memory 33 in association with each other.
- step S76 the second estimation unit 321 transmits the determination result information to the terminal 2 using the communication device 31.
- step S ⁇ b>61 the first estimation unit 221 of the terminal 2 acquires determination result information using the communication device 23 .
- step S ⁇ b>62 the first estimation unit 221 associates the determination result information with sound information stored in the memory 24 that corresponds to the determination result information. Thereby, the first estimation unit 221 can obtain feedback as to whether or not the user's behavior is correctly estimated based on the sound information of the unsteady sound transmitted to the server 3 as the output sound information.
- FIG. 17 is a flowchart showing an example of processing when the first trained model 241 is re-learned.
- the first estimation unit 221 of the terminal 2 determines whether or not it is time to re-learn.
- An example of the re-learning timing is the timing when a certain period of time has elapsed since the previous re-learning, or the timing when the amount of increase in sound information accumulated in the memory 24 since the previous re-learning has reached a predetermined amount.
- An example of the timing of re-learning when re-learning is performed for the first time is the timing after a certain period of time has elapsed since the terminal 2 started operating, or the sound information accumulated in the memory 24 after the terminal 2 started operating. is the timing when the amount of increase in has reached a predetermined amount.
- the first estimation unit 221 acquires sound information to be learned from the memory 24 (step S102).
- the first trained model 241 is the autoencoder 500
- an example of sound information to be learned is an increase newly accumulated in the memory 24 since the previous re-learning (or since the terminal 2 started operating). It is the sound information estimated to be normal sound among the sound information of the minute.
- the first trained model 241 is a convolutional neural network
- examples of sound information to be learned include sound information estimated as normal sound among the increased sound information and non-stationary sound among the increased sound information. This is sound information associated with determination result information indicating that it is the estimated sound information and is correct.
- step S101 if it is not the time to re-learn (NO in step S101), the process ends.
- step S103 the first estimation unit 221 re-learns the first trained model 241 using the learning target sound information.
- the trained model 241 is the autoencoder 500
- the trained model 241 is re-learned using the sound information estimated as the stationary sound.
- the trained model 241 is a convolutional neural network
- the sound information estimated to be stationary sound is given a label of stationary sound and re-learned, and the judgment result indicates that the sound information indicates non-stationary sound and is correct. Sound information associated with information is assigned a label of non-stationary sound and re-learned.
- FIG. 18 is a flowchart showing an example of processing when the second trained model 331 is re-learned.
- the second estimation unit 321 of the server 3 determines whether or not it is time to re-learn.
- An example of the timing of re-learning is the timing when a certain period of time has elapsed since the previous re-learning, or the timing when the increase in output sound information accumulated in the memory 33 since the previous re-learning has reached a predetermined amount.
- An example of the re-learning timing when re-learning is performed for the first time is the timing after a certain period of time has passed since the server 3 started operating, or the output sound accumulated in the memory 33 after the server 3 started operating. This is the timing when the amount of increase in information reaches a predetermined amount.
- the second estimation unit 321 acquires output sound information to be learned from the memory 33 (step S202).
- An example of the sound information to be learned is that the judgment result information indicating correctness among the increased output sound information accumulated in the memory 33 after the previous re-learning (or after the server 3 started operating) is associated sound information.
- step S201 if it is not the time to re-learn (NO in step S201), the process ends.
- step S203 the second estimation unit 321 re-learns the second trained model 331 using the learning target output sound information.
- the terminal 2 does not transmit all the sound information picked up by the microphone 21 to the server 3, but transmits only the sound information indicating the non-stationary sound. Since the data is output to the server 3, the amount of data flowing through the network 5 is reduced, and the load on the network 5, the terminal 2, and the server 3 can be reduced.
- FIG. 19 is a block diagram showing an example of the configuration of an information processing system 1A according to Embodiment 2 of the present disclosure.
- the same reference numerals are assigned to the same components as those in the first embodiment, and the description thereof is omitted.
- the first processor 22A of the terminal 2A includes a first estimation section 221A and a frequency conversion section 222.
- the first estimating unit 221A selects the sound information estimated as the non-stationary sound among the sound information indicating the sound picked up by the microphone 21, and extracts the sound information in the first frequency band, which is the frequency band with the maximum sound pressure level. is extracted, and the extracted sound information of the first frequency band is input to the frequency conversion unit 222 .
- the first frequency band is an ultrasonic band having the highest sound pressure level among the plurality of predetermined frequency bands.
- the frequency conversion unit 222 converts the input sound information of the first frequency band into sound information of a second frequency band, which is a lower frequency band than the first frequency band, and outputs the converted sound information of the second frequency band. Generate as sound information.
- the frequency conversion unit 222 generates supplementary information indicating the range of the first frequency band and includes it in the output sound information.
- FIG. 20 is an explanatory diagram of frequency conversion processing.
- the left diagram of FIG. 20 is sound information 701 of a spectrogram before frequency conversion.
- the right diagram of FIG. 20 is the sound information 703 of the spectrogram after frequency conversion.
- the vertical axis is frequency (Hz) and the horizontal axis is time (seconds).
- the vertical width of the sound information 701 is, for example, 100 kHz, and the horizontal width is, for example, 10 seconds.
- the first estimation unit 221A divides the sound information 701 into predetermined frequency bands of 20 kHz each.
- the frequency band from 0 kHz to 100 kHz is divided into five frequency bands of 20 kHz each.
- the first estimation unit 221A identifies the frequency band with the highest sound pressure level among the four frequency bands belonging to the ultrasonic band of 20 kHz or higher.
- the sound pressure level is the total value or average value of the sound pressure in each frequency band.
- the pixel value of each pixel represents the sound pressure
- the total value or average value of the pixel values of each frequency band is calculated as the sound pressure level.
- the reason why the sound information 701 is divided by 20 kHz is that the audible band is 20 kHz.
- first estimation section 221A extracts sound information 702 in the frequency band of 20 kHz to 40 kHz from sound information 701 .
- the reason why the frequency band of 0 to 20 kHz is omitted is that this frequency band is an audible band and contains a lot of unnecessary noise, which lowers the accuracy of action estimation.
- the frequency conversion unit 222 converts the sound information 702 into sound information 703 in the audible band of 0-20 kHz.
- the audible band is an example of the second frequency band.
- the sound information 703 is image information that includes the sound pressure distribution of the sound information 702 as it is.
- the sound information 703 has the same horizontal width of 10 seconds as the sound information 701, but the vertical width is compressed to 20 kHz. Therefore, it can be seen that the data amount of the sound information 703 is compressed to about one-fifth of that of the sound information 701 .
- the frequency conversion unit 222 generates supplementary information indicating the range of the frequency band of the sound information 702 “20 kHz to 40 kHz”.
- the frequency conversion unit 222 transmits the sound information 703 and the incidental information to the server 3 using the communication device 23 as output sound information. Furthermore, since the sound information 703 is sound information in the audible band, the sampling rate can be made smaller than when the sound information 702 is transmitted, and the amount of data can be reduced.
- the second processor 32A of the server 3A further includes a second estimator 321A.
- Memory 33 of server 3A includes second trained model 331A.
- the second estimation unit 321A estimates the output result obtained by inputting the sound information 703 output from the terminal 2 and the incidental information to the second trained model 331A as the behavior of the user.
- the second trained model 331A is a model constructed by performing machine learning on one or more data sets consisting of pairs of incidental information and sound information 703 and actions corresponding to the sound information 703 as teacher data.
- FIG. 21 is a flowchart showing an example of details of the process of step S14 of FIG. 14 in the second embodiment of the present disclosure.
- step S301 the first estimating unit 221A generates sound information 701 indicating the sound characteristics of the sound information estimated as the non-stationary sound.
- step S302 the first estimation unit 221A divides the sound information 701 into multiple frequency bands.
- step S303 the first estimating unit 221A extracts the sound information 702 of the first frequency band that belongs to the ultrasonic band among the plurality of divided frequency bands and has the highest sound pressure level.
- step S304 the frequency conversion unit 222 converts the sound information 702 into sound information 703 of the second frequency band (audible band).
- step S305 the frequency conversion unit 222 generates supplementary information indicating the range of the first frequency band.
- step S306 the frequency conversion unit 222 generates output sound information including the sound information 703 and incidental information.
- step S307 the frequency conversion unit 222 uses the communication device 23 to transmit the output sound information to the server 3A.
- the sound information in the first frequency band which is the frequency band including the non-stationary sound
- the extracted sound information is converted into sound information of a second frequency band lower than the first frequency band, and the converted sound information of the second frequency band is output from the terminal 2 to the server 3.
- the data amount of the sound information transmitted to the network 5 can be greatly reduced compared to the case of transmitting the time-series data of the sounded sound.
- the sound information 701 is divided by 20 kHz, but the division width is not limited to 20 kHz, and an appropriate value such as 1, 5, 10, 30, 50 kHz may be adopted.
- the vertical width of the sound information 701 is 100 kHz, but this is an example, and an appropriate value such as 200, 500, 1000 kHz may be adopted.
- the width of the sound information 701 is 10 seconds, but this is an example, and an appropriate value such as 1, 3, 5, 8, 20, 30 seconds may be adopted.
- the frequency conversion unit 222 converts the frequency using the sound information 701 of the spectrogram, but the present disclosure is not limited to this, and the frequency is converted with respect to the image information of the frequency characteristics of the sound indicated by the sound information. Alternatively, the frequency characteristics of the sound indicated by the sound information may be frequency-converted.
- FIG. 22 is a block diagram showing an example of a configuration of an information processing system 1B according to Embodiment 3 of the present disclosure.
- N is an integer of 2 or more
- terminals 2 such as terminals 2_1, 2_2, . . . , 2_N are arranged.
- Each terminal 2 is located at multiple locations within the residence 6 where activity needs to be monitored, one in each room.
- Each terminal 2 independently collects sound with a microphone 21, generates output sound information from the sound information when the collected sound is non-stationary sound, and transmits the generated output sound information to the server 3. Send to
- the second estimation unit 321 of the server 3 inputs each piece of output sound information transmitted from each terminal 2 to the second trained model 331, and individually estimates the behavior of the user from each piece of output sound information.
- the terminal 2 has the same configuration as in the first embodiment, but may have the same configuration as in the second embodiment.
- each terminal 2 is provided with one or more sensors other than the microphone 21 in the configuration of the third embodiment.
- FIG. 23 is a block diagram showing an example of a configuration of an information processing system 1C according to Embodiment 4 of the present disclosure.
- the same components as in Embodiments 1 to 3 are denoted by the same reference numerals, and descriptions thereof are omitted.
- Each terminal 2 further includes a sensor 25 and a sensor 26.
- Sensor 25 is a CO2 sensor that detects the concentration of carbon dioxide, a humidity sensor, or a temperature sensor.
- the sensor 26 is a sensor different from the sensor 25 among the CO2 sensor, humidity sensor, and temperature sensor.
- the sensor 25 periodically performs sensing and inputs first sensing information having a certain time width to the first estimating section 221 .
- the sensor 26 periodically performs sensing and inputs second sensing information having a certain time width to the first estimator 221 .
- the first estimation unit 221 inputs the sound information, the first sensing information, and the second sensing information to the first trained model 241, and estimates whether the state inside the house 6 is a steady state or an unsteady state. do.
- the steady state refers to a state in which the user does not take action.
- a non-stationary state refers to a state in which the user has taken some action.
- the first estimation unit 221 inputs the sound information, the first sensing information, and the second sensing information to the first trained model 241, and estimates whether the state inside the house 6 is a steady state or an unsteady state. do.
- the first estimation unit 221 estimates that the state inside the house 6 is a steady state
- the first estimation unit 221 transmits the sound information, the first sensing information, and the second sensing information to the server 3 as output sound information.
- the first trained model 241 is composed of the autoencoder 500, one or more data consisting of a set of sound information indicating a steady sound, first sensing information indicating a steady state, and second sensing information indicating a steady state. It is constructed by machine learning using the set as teacher data.
- the first trained model 241 is composed of the convolutional neural network 600, one or more sets of sound information, first sensing information, second sensing information, and a label indicating a steady state or an unsteady state It is constructed by machine learning using a dataset as teacher data.
- the first trained model 241 includes three models: a first trained model corresponding to sound information, a second trained model corresponding to first sensing information, and a third trained model corresponding to second sensing information. It may consist of a trained model. In this case, when at least one of the first to third trained models is estimated to be a non-stationary sound (or non-stationary state), the first estimation unit 221 estimates that the state inside the house 6 is a non-stationary state. good.
- the second trained model 331 consists of a set of sound information, first sensing information, and second sensing information constituting output sound information indicating an unsteady state, and actions corresponding to the output sound information.
- a model constructed by machine learning one or more datasets.
- the server 3 is not limited to a cloud server, and may be a home server, for example.
- network 5 is a local area network.
- the terminal 2 may be mounted on the device 4.
- the first estimation unit 221A shown in FIG. 19 may extract sound information of a plurality of first frequency bands from sound information estimated as non-stationary sound.
- the frequency conversion unit 222 converts the sound information of the plurality of first frequency bands extracted by the first estimation unit 221A into the sound information of the second frequency band, which is the lowest first frequency band among the plurality of first frequency bands.
- a plurality of converted sound information in the second frequency band may be synthesized, and the synthesized sound information may be generated as output sound information.
- FIG. 24 is an explanatory diagram of Modification 3 of the present disclosure.
- the left diagram of FIG. 24 is sound information 801 of a spectrogram including non-stationary sound before frequency conversion.
- the middle diagram in FIG. 24 shows sound information 802 of a spectrogram divided into a plurality of frequency bands.
- the right diagram of FIG. 25 is the sound information 803 of the spectrogram after frequency conversion.
- the vertical axis is frequency (Hz) and the horizontal axis is time (seconds).
- the first estimation unit 221A divides the sound information 801 into predetermined frequency bands of 20 kHz each.
- the frequency band from 0 kHz to 100 kHz is divided into five frequency bands of 20 kHz each, and five pieces of sound information 8021, 8022, 8023, 8024 and 8025 are obtained.
- These five pieces of sound information 8021 to 8025 are examples of a plurality of pieces of sound information of the first frequency band.
- the frequency conversion unit 222 converts each of the sound information 8021 to 8025 into sound information in the audible band, and adds up the converted five pieces of sound information to generate the sound information 803 .
- Sound information 803 is an example of sound information of the second frequency band. As a result, sound information 803 in which the data amount of the sound information 801 is compressed to about 1/5 is obtained. Then, the frequency conversion unit 222 transmits the sound information 803 to the server 3 using the communication device 23 as output sound information. Since the sound information 803 is sound information in the audible band, the sampling rate can be made smaller than in the case of transmitting the sound information 801, and the amount of data can be reduced.
- the second estimation unit 321A of the server 3A may estimate the user's behavior using the second trained model 331 shown in the first embodiment. That is, the second estimation unit 321A may estimate the output result obtained by inputting the sound information 803 to the second trained model 331 as the behavior of the user.
- the first estimator 221A extracts sound information in a first frequency band that includes a non-stationary sound among a plurality of first frequency bands from sound information that is estimated to be a non-stationary sound.
- the frequency conversion unit 222 converts the sound information of the first frequency band extracted by the first estimation unit 221A into the sound information of the second frequency band, which is the lowest first frequency band among the plurality of first frequency bands.
- the converted sound information of the second frequency band may be synthesized, and the synthesized sound information may be generated as the output sound information.
- FIG. 25 is an explanatory diagram of Modification 4 of the present disclosure.
- the left diagram of FIG. 24 is sound information 901 of a spectrogram before frequency conversion.
- the middle diagram in FIG. 24 shows sound information 902 of a frequency band containing an abnormal sound equal to or greater than a predetermined value.
- the right diagram of FIG. 25 is sound information 902 after frequency conversion.
- the first estimation unit 221A divides the sound information 901 into predetermined frequency bands of 20 kHz each, and extracts the sound information 902 of frequency bands in which the sound pressure level is equal to or higher than a predetermined value in the divided frequency bands.
- sound information 902 including sound information 9021 in the frequency band of 20 kHz to 40 kHz and sound information 9022 in the frequency band of 40 kHz to 60 kHz is extracted.
- the sound pressure level is the total value or average value of the sound pressure in each frequency band, as in the second embodiment.
- the first estimation unit 221A generates supplementary information indicating the frequency band (20 kHz to 40 kHz) of the extracted sound information 9021 and the frequency band (40 kHz to 60 kHz) of the extracted sound information 9022.
- the frequency conversion unit 222 converts each of the sound information 9021 and the sound information 9022 into sound information in an audible band of 0 to 20 kHz, and adds the converted two pieces of sound information to generate the sound information 903. Then, the frequency conversion unit 222 uses the communication device 23 to transmit the sound information 903 and the incidental information to the server 3A as output sound information.
- the second estimation unit 321A of the server 3A can estimate the user's behavior using the learned model 331A shown in the second embodiment. That is, the second estimation unit 321A may input the sound information 903 and the incidental information to the trained model 331A, and estimate the obtained output result as the behavior of the user.
- the method of frequency conversion in the frequency conversion unit 222 is not particularly limited, but as an example, the addition theorem of trigonometric functions can be adopted as shown in the following formula.
- the frequency conversion unit 222 multiplies the sound signal in the frequency band of 20 kHz to 40 kHz by the sound signal of 20 kHz, and obtains the difference.
- Frequency conversion may be performed by extracting the component (sin( ⁇ )).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023538459A JPWO2023008260A1 (https=) | 2021-07-29 | 2022-07-19 | |
| CN202280047206.2A CN117597734A (zh) | 2021-07-29 | 2022-07-19 | 信息处理系统、信息处理方法以及信息处理程序 |
| US18/421,511 US20240161771A1 (en) | 2021-07-29 | 2024-01-24 | Information processing system, information processing method, and non-transitory computer readable recording medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021-124570 | 2021-07-29 | ||
| JP2021124570 | 2021-07-29 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/421,511 Continuation US20240161771A1 (en) | 2021-07-29 | 2024-01-24 | Information processing system, information processing method, and non-transitory computer readable recording medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023008260A1 true WO2023008260A1 (ja) | 2023-02-02 |
Family
ID=85087598
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/028075 Ceased WO2023008260A1 (ja) | 2021-07-29 | 2022-07-19 | 情報処理システム、情報処理方法、及び情報処理プログラム |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240161771A1 (https=) |
| JP (1) | JPWO2023008260A1 (https=) |
| CN (1) | CN117597734A (https=) |
| WO (1) | WO2023008260A1 (https=) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011237865A (ja) * | 2010-05-06 | 2011-11-24 | Advanced Telecommunication Research Institute International | 生活空間の見守りシステム |
| JP2019132912A (ja) * | 2018-01-29 | 2019-08-08 | 富士通株式会社 | 生活音記録装置及び生活音記録方法 |
| KR20210133496A (ko) * | 2020-04-29 | 2021-11-08 | 주식회사 더바인코퍼레이션 | 인공 신경망을 활용한 노인 생활 모니터링 방법 및 장치 |
Family Cites Families (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104737227B (zh) * | 2012-11-05 | 2017-11-10 | 松下电器(美国)知识产权公司 | 语音音响编码装置、语音音响解码装置、语音音响编码方法和语音音响解码方法 |
| TWI618051B (zh) * | 2013-02-14 | 2018-03-11 | 杜比實驗室特許公司 | 用於利用估計之空間參數的音頻訊號增強的音頻訊號處理方法及裝置 |
| US9747926B2 (en) * | 2015-10-16 | 2017-08-29 | Google Inc. | Hotword recognition |
| WO2017080835A1 (en) * | 2015-11-10 | 2017-05-18 | Dolby International Ab | Signal-dependent companding system and method to reduce quantization noise |
| CN110276235B (zh) * | 2018-01-25 | 2023-06-16 | 意法半导体公司 | 通过感测瞬态事件和连续事件的智能装置的情境感知 |
| KR102623998B1 (ko) * | 2018-07-17 | 2024-01-12 | 삼성전자주식회사 | 음성인식을 위한 전자장치 및 그 제어 방법 |
| ES3021337T3 (en) * | 2019-02-21 | 2025-05-26 | Ericsson Telefon Ab L M | Spectral shape estimation from mdct coefficients |
| US12289583B2 (en) * | 2019-09-09 | 2025-04-29 | Nippon Telegraph And Telephone Corporation | Sound collection and emission apparatus, sound collection and emission method, and program |
| CN110838299B (zh) * | 2019-11-13 | 2022-03-25 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种瞬态噪声的检测方法、装置及设备 |
| US11282527B2 (en) * | 2020-02-28 | 2022-03-22 | Synaptics Incorporated | Subaudible tones to validate audio signals |
| CA3115423A1 (en) * | 2020-05-01 | 2021-11-01 | Systemes De Controle Actif Soft Db Inc. | A system and a method for sound recognition |
| US11443760B2 (en) * | 2020-05-08 | 2022-09-13 | DTEN, Inc. | Active sound control |
| EP3944240A1 (en) * | 2020-07-20 | 2022-01-26 | Nederlandse Organisatie voor toegepast- natuurwetenschappelijk Onderzoek TNO | Method of determining a perceptual impact of reverberation on a perceived quality of a signal, as well as computer program product |
| CN111968662B (zh) * | 2020-08-10 | 2024-09-03 | 北京小米松果电子有限公司 | 音频信号的处理方法及装置、存储介质 |
-
2022
- 2022-07-19 CN CN202280047206.2A patent/CN117597734A/zh active Pending
- 2022-07-19 WO PCT/JP2022/028075 patent/WO2023008260A1/ja not_active Ceased
- 2022-07-19 JP JP2023538459A patent/JPWO2023008260A1/ja active Pending
-
2024
- 2024-01-24 US US18/421,511 patent/US20240161771A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011237865A (ja) * | 2010-05-06 | 2011-11-24 | Advanced Telecommunication Research Institute International | 生活空間の見守りシステム |
| JP2019132912A (ja) * | 2018-01-29 | 2019-08-08 | 富士通株式会社 | 生活音記録装置及び生活音記録方法 |
| KR20210133496A (ko) * | 2020-04-29 | 2021-11-08 | 주식회사 더바인코퍼레이션 | 인공 신경망을 활용한 노인 생활 모니터링 방법 및 장치 |
Non-Patent Citations (2)
| Title |
|---|
| NAKAO, TATSUYA; HIGASHIDE, TAICHI; YANOKURA, IORI; KAKIUCHI, YOHEI; OKADA, KEI; INABA, MASAYUKI: "1P1-D15 Life support behavior based on understanding the relationship between situation change and sound using look-around motion for unknown sound", PREPRINTS OF THE 2020 JSME CONFERENCE ON ROBOTICS AND MECHATRONICS, JAPAN SOCIETY OF MECHANICAL ENGINEERS, JP, 30 April 2020 (2020-04-30) - 30 May 2020 (2020-05-30), JP, pages 1 - 4, XP009542997, DOI: 10.1299/jsmermd.2020.1P1-D15 * |
| SARUDATE, ASHITA. ITOH, KENZO: "K-021 The Living Sound Identification System with the Mail Function of Cellular Phone", PROCEEDINGS OF THE 8TH FORUM ON INFORMATION TECHNOLOGY (FIT2009); TOHOKU, JAPAN; SEPTEMBER 2-4, 2009, vol. 18, no. 3, 31 July 2009 (2009-07-31) - 4 September 2009 (2009-09-04), pages 569 - 574, XP009542996 * |
Also Published As
| Publication number | Publication date |
|---|---|
| US20240161771A1 (en) | 2024-05-16 |
| JPWO2023008260A1 (https=) | 2023-02-02 |
| CN117597734A (zh) | 2024-02-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10787762B2 (en) | Home appliance and method for controlling the same | |
| CN109920419B (zh) | 语音控制方法和装置、电子设备及计算机可读介质 | |
| CN117789744B (zh) | 基于模型融合的语音降噪方法、装置及存储介质 | |
| JP2011237865A (ja) | 生活空間の見守りシステム | |
| CN115171703A (zh) | 分布式语音唤醒方法和装置、存储介质及电子装置 | |
| CN115148203B (zh) | 拾音校准方法、装置、非易失性存储介质及电子设备 | |
| CN112700765B (zh) | 辅助技术 | |
| Englert et al. | Reduce the number of sensors: Sensing acoustic emissions to estimate appliance energy usage | |
| WO2023008260A1 (ja) | 情報処理システム、情報処理方法、及び情報処理プログラム | |
| CN115148202B (zh) | 语音指令的处理方法和装置、存储介质及电子装置 | |
| US20250029630A1 (en) | Action estimation device, action estimation method, and recording medium | |
| US20190056255A1 (en) | Monitoring device for subject behavior monitoring | |
| Vuegen et al. | Monitoring activities of daily living using Wireless Acoustic Sensor Networks in clean and noisy conditions | |
| CN119580766B (zh) | 静默时间窗的预测方法和装置、存储介质及电子装置 | |
| CN116206618B (zh) | 设备唤醒方法、存储介质及电子装置 | |
| JP7621329B2 (ja) | 行動識別方法、行動識別装置及び行動識別プログラム | |
| Arakawa et al. | Improving recognition accuracy for activities of daily living by adding time and area related features | |
| CN116386597B (zh) | 方言识别模型的构建方法和装置、存储介质及电子装置 | |
| JP7790028B2 (ja) | 制御方法、制御装置、及びプログラム | |
| CN116504242A (zh) | 智能设备的筛选方法和装置、存储介质及电子装置 | |
| CN115988152B (zh) | 目标烹饪视频的生成方法、装置、存储介质及电子装置 | |
| JP6899358B2 (ja) | 宅内管理システム、宅内管理プログラム、および宅内管理方法 | |
| JP6475935B2 (ja) | 情報処理方法、情報処理プログラム、情報処理システムおよび情報処理装置 | |
| CN112686171A (zh) | 数据处理方法、电子设备及相关产品 | |
| Tanaka et al. | Home activity recognition using infrequently-monitored HEMS Data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22849323 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202280047206.2 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023538459 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22849323 Country of ref document: EP Kind code of ref document: A1 |