EP3193317A1 - Activity classification from audio - Google Patents

Activity classification from audio Download PDF

Info

Publication number
EP3193317A1
EP3193317A1 EP16305034.7A EP16305034A EP3193317A1 EP 3193317 A1 EP3193317 A1 EP 3193317A1 EP 16305034 A EP16305034 A EP 16305034A EP 3193317 A1 EP3193317 A1 EP 3193317A1
Authority
EP
European Patent Office
Prior art keywords
audio
activity
sensor
segments
sensors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP16305034.7A
Other languages
German (de)
French (fr)
Inventor
Brian ERIKSSON
Carole Le Goff
Martin May
Victoria COLEMAN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Licensing SAS
Original Assignee
Thomson Licensing SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing SAS filed Critical Thomson Licensing SAS
Priority to EP16305034.7A priority Critical patent/EP3193317A1/en
Publication of EP3193317A1 publication Critical patent/EP3193317A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/02Alarms for ensuring the safety of persons
    • G08B21/04Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons
    • G08B21/0438Sensor means for detecting
    • G08B21/0469Presence detectors to detect unsafe condition, e.g. infrared sensor, microphone
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/02Alarms for ensuring the safety of persons
    • G08B21/04Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons
    • G08B21/0407Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons based on behaviour analysis
    • G08B21/0423Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons based on behaviour analysis detecting deviation from an expected pattern of behaviour or schedule
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/18Status alarms
    • G08B21/22Status alarms responsive to presence or absence of persons

Definitions

  • the proposed method and apparatus relates to predicting activities engaged in by an individual based on audio sensor data.
  • the ability to detect user activity and context in an environment such as a home or residence has applications for handicapped or elderly care, home health monitoring, and quantified self (to name only a few).
  • the proposed method and apparatus could also be used to monitor individuals on house arrest such as the Blade Runner.
  • the proposed method and apparatus also has applications in other restricted environments such as prisons or schools.
  • accurate detection of the activity of individuals in any particular environment requires new sensors to be introduced that are powerful enough to distinguish between common activities. Activities could be common household tasks such as doing dishes, preparing a meal, etc. or could include social activities such as watching TV, listening to the radio, Skyping (Face Timing), entertaining guests, etc.
  • the present proposed method and apparatus uses a new framework that combines an array of audio sensors in the environment with machine learning to classify household activities and provide insights for care givers (workers) or people charged with monitoring the activities of one or more individuals.
  • Household activities will be the initial focus. Such activities include cooking (running the sink, cooking food, etc.), cleaning (running the vacuum cleaner, dishwasher operating, etc.), and social activities (detecting multiple people in the home, TV operating, etc.).
  • Use cases initially are focused on handicapped care, elder care, home health monitoring and the like. Both the types of activities and the use of the proposed method and apparatus go beyond these initial activities and uses to other restricted environments.
  • a method and apparatus for predicting an activity including acquiring segments of audio from audio sensors in an audio sensor array, extracting sensor metadata from the audio sensor array, de-noising the segments of audio to remove artifacts and noise, converting the de-noised segments of audio to a feature vector, determining the activity using the feature vector and the sensor metadata and a training database, the training database including samples of audio of known activities and metadata and calculating a confidence score.
  • the calculated confidence score is used to identify (predict) an activity engaged in by an individual being monitored, the activity being associated with said segments of audio acquired from said audio sensors.
  • the identified activity can be used to alert care givers or others monitoring individuals to abnormal behavior.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage.
  • DSP digital signal processor
  • ROM read only memory
  • RAM random access memory
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
  • the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
  • an array of sensors 105 including microphones, vibration detectors/sensors and the like provide raw input to the proposed method and apparatus.
  • the physical component of the sensor array of the proposed method and apparatus is a series of audio sensors. These audio sensors can be microphones, vibration sensors, or any other sensor that could sense audio from the environment. To acquire segments of audio that can be processed, the audio signal will be time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment. The output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor.
  • the raw time-windowed audio sensor data is provided to the audio pre-processing module 110 which performs de-noising of the audio signals and feature extraction.
  • the sensor metadata 115 portion of the present proposed method and apparatus extracts (acquires, retrieves) information (data) from the array of audio sensors.
  • the metadata (extracted information) is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known.
  • the sensor metadata is provided to the audio-to-activity learning engine (module) 120, which identifies the activity associated with a particular audio signal with a prescribed confidence level.
  • Training data is stored in a data base 125.
  • the training data includes audio features of known activities.
  • the training data is necessary to learn models of audio to household activity and is a set of training samples of audio with known activities and metadata.
  • the training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.).
  • the training data is provided to the audio-to-activity learning engine module.
  • the audio pre-processing module converts the raw windowed audio signal to de-noised relevant features for detection of activities.
  • the audio signal must be de-noised to remove artifacts and measured noise.
  • the proposed method and apparatus performs wavelet de-noising using a hard threshold.
  • a discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise).
  • the inverse wavelet transform is performed to convert the de-noised signal back to the time-series audio domain.
  • the de-noised audio signal is converted to a feature vector that can be used as the input of a machine learning algorithm in the audio-to-activity learning module.
  • the proposed method and apparatus uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.).
  • the output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal.
  • more sophisticated feature extraction e.g., DFCC extraction, deep learning
  • the feature vector is provided to the audio-to-activity learning engine module.
  • the audio-to-activity learning engine module classifies what activity is currently being performed.
  • a machine learning classifier is built for each desired activity and for each type of sensor and placement.
  • An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running.
  • Another sensor modality such as vibration sensors
  • would require a different learning model, as would a different activity to classify such as, water is boiling, the garbage disposal is running, etc.).
  • the proposed method and apparatus uses classification trees.
  • the audio-to-activity learning engine module uses bootstrap classification (learning multiple classification models by resampling the training set), calculates a confidence level of the activity classification.
  • the output of the audio-to-activity learning engine module is the audio signal, the classification of the activity and the confidence level of the activity classification.
  • Fig. 2 is a flowchart of an exemplary embodiment of the proposed method in accordance with the principles described herein.
  • the raw time-windowed sensor data is acquired from the sensor array.
  • the sensor array includes microphones, vibration detectors/sensors or any other sensor that could sense audio from the environment.
  • the audio signal is time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment.
  • the output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor.
  • the raw time-windowed audio signals are de-noised, which removes artifacts and measured noise. This is accomplished by performing wavelet de-noising using a hard threshold.
  • a discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise). Finally, the inverse wavelet transform is applied to convert the de-noised signal back to the time-series audio domain.
  • the de-noised audio signal is converted to a feature vector.
  • the proposed method and apparatus uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.).
  • the output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal.
  • more sophisticated feature extraction e.g., DFCC extraction, deep learning
  • the sensor metadata is extracted (acquired, retrieved) from the sensor array.
  • the metadata is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known.
  • the classification model is retrieved (acquired).
  • the classification model is learned. For example, the coefficients for a Support Vector Machine Classifier are determined using the training data to classify a particular activity.
  • the audio signal is classified (determined) as to the activity being performed.
  • a machine learning classifier is built for each desired activity and for each type of sensor and placement.
  • Training data is stored in a data base.
  • the training data includes audio features of known activities.
  • the training data is necessary to learn models of audio-to-activity (e.g., household activity) and is a set of training samples of audio with known activities and metadata.
  • the training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.).
  • An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running.
  • Another sensor modality such as vibration sensors
  • Classification trees are used to classify the audio signal.
  • Other algorithms could be used, such as boosted decision trees or kernel-based Support Vector Machines (SVMs).
  • bootstrap classification learning multiple classification models by resampling the training set
  • the output of the audio-to-activity learning engine module is the audio signal, the classification (determination) of the activity and the confidence level of the activity classification.
  • Fig. 3 is a block diagram of an exemplary embodiment of the proposed system in accordance with the principles described herein.
  • a sensor array (305) includes microphones, vibration detectors/sensors or any other sensor that could sense audio from the environment.
  • the audio signal will be time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment.
  • the output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor.
  • the raw time-windowed audio sensor data is provided to the audio pre-processing module which performs de-noising of the audio signals and feature extraction.
  • the sensor metadata is extracted (acquired, retrieved) from the array of audio sensors.
  • the metadata (extracted information) is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known.
  • the sensor metadata is provided to the audio-to-activity learning engine module, which identifies the activity associated with a particular audio signal with a prescribed confidence level.
  • Training data is stored in a data base (345).
  • the training data includes audio features of known activities.
  • the training data is necessary to learn models of audio to household activity and is a set of training samples of audio with known activities and metadata.
  • the training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.).
  • the training data is provided to the audio-to-activity learning engine module.
  • the audio pre-processing module (315) converts the raw windowed audio signal to de-noised relevant features for detection of activities.
  • the audio pre-processing module has two modules.
  • the first module is the de-noising module (320), which de-noises the raw time-windowed audio signals. De-noising removes artifacts and measured noise.
  • the de-noising module performs wavelet de-noising using a hard threshold.
  • a discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise).
  • the inverse wavelet transform is performed to convert the de-noised signal back to the time-series audio domain.
  • the second module of the audio pre-processing module is the feature extraction module (325).
  • the feature extraction module converts the de-noised audio signal to a feature vector that can be used as the input of a machine learning algorithm in the audio-to-activity learning module.
  • the feature extraction module uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.).
  • the output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal. Depending on performance, more sophisticated feature extraction (e.g., DFCC extraction, deep learning) may be performed in this step.
  • the feature vector is also provided to the audio-to-activity learning engine module.
  • the audio-to-activity learning engine (330) has two modules. The first is the classify activity module (335). The classify activity module retrieves a classification model. The classify activity module also extracts sensor metadata from the sensor array. Using the classification model, the feature vector of the audio signal and the sensor metadata, the classify activity module of the audio-to-activity learning engine module classifies what activity is currently being performed.
  • the classify activity module is a machine learning algorithm. Using the training set of previous acquired/processed audio features (the length-200 vector of aggregated frequency energy) of known activities, a machine learning classifier is built for each desired activity and for each type of sensor and placement. An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running.
  • the classify activity module uses classification trees. Other algorithms could be used, such as boosted decision trees or kernel-based SVMs.
  • the second module of the audio-to-activity learning engine is the calculate confidence level (score) module (340).
  • the calculate confidence level (score) module uses bootstrap classification (learning multiple classification models by resampling the training set), to calculate a confidence level (score) of the activity classification.
  • the output of the audio-to-activity learning engine module is the audio signal, the classification of the activity and the confidence level of the activity classification.
  • the proposed method and apparatus may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof.
  • Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs).
  • ASICs application specific integrated circuits
  • RISCs reduced instruction set computers
  • FPGAs field programmable gate arrays
  • the proposed method and apparatus is implemented as a combination of hardware and software.
  • the software is preferably implemented as an application program tangibly embodied on a program storage device.
  • the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
  • the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s).
  • CPU central processing units
  • RAM random access memory
  • I/O input/output
  • the computer platform also includes an operating system and microinstruction code.
  • the various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system.
  • various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
  • the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
  • general-purpose devices which may include a processor, memory and input/output interfaces.
  • the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.

Abstract

A method and apparatus for predicting household activity are described including acquiring segments of audio from audio sensors in an audio sensor array, extracting sensor metadata from the audio sensor array, de-noising the segments of audio to remove artifacts and noise, converting the de-noised segments of audio to a feature vector, determining activity using the feature vector and the sensor metadata and a training database and calculating a confidence score.

Description

    FIELD
  • The proposed method and apparatus relates to predicting activities engaged in by an individual based on audio sensor data.
  • BACKGROUND
  • This section is intended to introduce the reader to various aspects of art, which may be related to the present embodiments that are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light.
  • Other methods have been employed to detect and/or predict activities such as using more intrusive handheld sensors, using position data from motion sensors, signal isolation to determine speech attributes, and intrusion detection (such as a security system). In contrast to such detection/prediction methods, the present proposed method and apparatus focuses on audio signal detection and identification of an activity associated with the detected audio signal.
  • SUMMARY
  • The ability to detect user activity and context in an environment such as a home or residence has applications for handicapped or elderly care, home health monitoring, and quantified self (to name only a few). In the home or residential environment the proposed method and apparatus could also be used to monitor individuals on house arrest such as the Blade Runner. The proposed method and apparatus also has applications in other restricted environments such as prisons or schools. Unfortunately, accurate detection of the activity of individuals in any particular environment requires new sensors to be introduced that are powerful enough to distinguish between common activities. Activities could be common household tasks such as doing dishes, preparing a meal, etc. or could include social activities such as watching TV, listening to the radio, Skyping (Face Timing), entertaining guests, etc.
  • The present proposed method and apparatus uses a new framework that combines an array of audio sensors in the environment with machine learning to classify household activities and provide insights for care givers (workers) or people charged with monitoring the activities of one or more individuals. Household activities will be the initial focus. Such activities include cooking (running the sink, cooking food, etc.), cleaning (running the vacuum cleaner, dishwasher operating, etc.), and social activities (detecting multiple people in the home, TV operating, etc.). Use cases initially are focused on handicapped care, elder care, home health monitoring and the like. Both the types of activities and the use of the proposed method and apparatus go beyond these initial activities and uses to other restricted environments.
  • While the present proposed method and apparatus are described in terms of household activities, it should be noted that the proposed method and apparatus are not so limited and include other restricted environments.
  • A method and apparatus for predicting an activity are described including acquiring segments of audio from audio sensors in an audio sensor array, extracting sensor metadata from the audio sensor array, de-noising the segments of audio to remove artifacts and noise, converting the de-noised segments of audio to a feature vector, determining the activity using the feature vector and the sensor metadata and a training database, the training database including samples of audio of known activities and metadata and calculating a confidence score. The calculated confidence score is used to identify (predict) an activity engaged in by an individual being monitored, the activity being associated with said segments of audio acquired from said audio sensors. The identified activity can be used to alert care givers or others monitoring individuals to abnormal behavior.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The proposed method and apparatus is best understood from the following detailed description when read in conjunction with the accompanying drawings. The drawings include the following figures briefly described below:
    • Fig. 1 is a schematic diagram of an exemplary system in accordance with the principles of the proposed method and apparatus.
    • Fig. 2 is a flowchart of an exemplary embodiment of the proposed method in accordance with the principles described herein.
    • Fig. 3 is a block diagram of an exemplary embodiment of the proposed system in accordance with the principles described herein.
  • It should be understood that the drawing(s) are for purposes of illustrating the concepts of the disclosure and is not necessarily the only possible configuration for illustrating the disclosure.
  • DETAILED DESCRIPTION
  • The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
  • All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
  • Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
  • Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
  • The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage.
  • Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
  • Referring to Fig. 1, which is a schematic diagram of an exemplary system in accordance with the principles of the proposed method and apparatus, an array of sensors 105 including microphones, vibration detectors/sensors and the like provide raw input to the proposed method and apparatus. The physical component of the sensor array of the proposed method and apparatus is a series of audio sensors. These audio sensors can be microphones, vibration sensors, or any other sensor that could sense audio from the environment. To acquire segments of audio that can be processed, the audio signal will be time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment. The output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor. The raw time-windowed audio sensor data is provided to the audio pre-processing module 110 which performs de-noising of the audio signals and feature extraction.
  • The sensor metadata 115 portion of the present proposed method and apparatus extracts (acquires, retrieves) information (data) from the array of audio sensors. The metadata (extracted information) is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known. The sensor metadata is provided to the audio-to-activity learning engine (module) 120, which identifies the activity associated with a particular audio signal with a prescribed confidence level.
  • Training data is stored in a data base 125. The training data includes audio features of known activities. The training data is necessary to learn models of audio to household activity and is a set of training samples of audio with known activities and metadata. The training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.). The training data is provided to the audio-to-activity learning engine module.
  • The audio pre-processing module converts the raw windowed audio signal to de-noised relevant features for detection of activities. First, the audio signal must be de-noised to remove artifacts and measured noise. The proposed method and apparatus performs wavelet de-noising using a hard threshold. A discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise). Finally, the inverse wavelet transform is performed to convert the de-noised signal back to the time-series audio domain.
  • Next, the de-noised audio signal is converted to a feature vector that can be used as the input of a machine learning algorithm in the audio-to-activity learning module. The proposed method and apparatus uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.). The output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal. Depending on performance, more sophisticated feature extraction (e.g., DFCC extraction, deep learning) may be performed in this step. The feature vector is provided to the audio-to-activity learning engine module.
  • Using the feature vector of the audio signal and the sensor metadata, the audio-to-activity learning engine module classifies what activity is currently being performed. Using the training set of previous acquired/processed audio features (the length-200 vector of aggregated frequency energy) of known activities, a machine learning classifier is built for each desired activity and for each type of sensor and placement. An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running. Another sensor modality (such as vibration sensors) would require a different learning model, as would a different activity to classify (such as, water is boiling, the garbage disposal is running, etc.). The proposed method and apparatus uses classification trees. Other algorithms could be used, such as boosted decision trees or kernel-based Support Vector Machines (SVMs). Using bootstrap classification (learning multiple classification models by resampling the training set), the audio-to-activity learning engine module calculates a confidence level of the activity classification. Thus, the output of the audio-to-activity learning engine module is the audio signal, the classification of the activity and the confidence level of the activity classification.
  • Fig. 2 is a flowchart of an exemplary embodiment of the proposed method in accordance with the principles described herein. At 205 the raw time-windowed sensor data is acquired from the sensor array. The sensor array includes microphones, vibration detectors/sensors or any other sensor that could sense audio from the environment. To acquire segments of audio signals that can be processed, the audio signal is time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment. The output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor. At 210 the raw time-windowed audio signals are de-noised, which removes artifacts and measured noise. This is accomplished by performing wavelet de-noising using a hard threshold. A discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise). Finally, the inverse wavelet transform is applied to convert the de-noised signal back to the time-series audio domain.
  • At 215 the de-noised audio signal is converted to a feature vector. The proposed method and apparatus uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.). The output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal. Depending on performance, more sophisticated feature extraction (e.g., DFCC extraction, deep learning) may be performed in this step.
  • At 220 the sensor metadata is extracted (acquired, retrieved) from the sensor array. The metadata (extracted information) is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known.
  • At 225 the classification model is retrieved (acquired). The classification model is learned. For example, the coefficients for a Support Vector Machine Classifier are determined using the training data to classify a particular activity.
  • At 230 using the feature vector of the audio signal, training data and the sensor metadata, the audio signal is classified (determined) as to the activity being performed. Using the training set of previous acquired/processed audio features (the length-200 vector of aggregated frequency energy) of known activities, a machine learning classifier is built for each desired activity and for each type of sensor and placement. Training data is stored in a data base. The training data includes audio features of known activities. The training data is necessary to learn models of audio-to-activity (e.g., household activity) and is a set of training samples of audio with known activities and metadata. The training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.). An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running. Another sensor modality (such as vibration sensors) would require a different learning model, as would a different activity to classify (such as, water is boiling, the garbage disposal is running, etc.). Classification trees are used to classify the audio signal. Other algorithms could be used, such as boosted decision trees or kernel-based Support Vector Machines (SVMs).
  • At 235 bootstrap classification (learning multiple classification models by resampling the training set) is used to calculate a confidence level of the activity classification. Thus, the output of the audio-to-activity learning engine module is the audio signal, the classification (determination) of the activity and the confidence level of the activity classification.
  • Fig. 3 is a block diagram of an exemplary embodiment of the proposed system in accordance with the principles described herein. A sensor array (305) includes microphones, vibration detectors/sensors or any other sensor that could sense audio from the environment. To acquire segments of audio that can be processed, the audio signal will be time-windowed. Initially, the size of this window will be, for example, 10 seconds. This value may change depending upon the deployment environment. The output of this portion of the system is a series of 10 seconds of recorded audio clips from the audio sensor. The raw time-windowed audio sensor data is provided to the audio pre-processing module which performs de-noising of the audio signals and feature extraction.
  • The sensor metadata is extracted (acquired, retrieved) from the array of audio sensors. The metadata (extracted information) is related to the placement of each of the audio sensors and is stored in memory. Therefore, for each acquired audio signal, the location in the environment of each audio sensor, the sensitivity of the sensor, what type of sensor (e.g., microphone vs. vibration), and the height of the audio sensor placement in the environment are all known. The sensor metadata is provided to the audio-to-activity learning engine module, which identifies the activity associated with a particular audio signal with a prescribed confidence level.
  • Training data is stored in a data base (345). The training data includes audio features of known activities. The training data is necessary to learn models of audio to household activity and is a set of training samples of audio with known activities and metadata. The training data is acquired (through internal tests or purchased through a third party) and includes a set of audio signals with associated activities and details on the sensor setup (the metadata of type of sensor, location, etc.). The training data is provided to the audio-to-activity learning engine module.
  • The audio pre-processing module (315) converts the raw windowed audio signal to de-noised relevant features for detection of activities. The audio pre-processing module has two modules. The first module is the de-noising module (320), which de-noises the raw time-windowed audio signals. De-noising removes artifacts and measured noise. The de-noising module performs wavelet de-noising using a hard threshold. A discrete Haar wavelet transform is performed on the raw time-windowed audio signal, transforming the audio amplitude signal to a series of coefficients measuring the energy of a Haar wavelet basis. Then, all coefficients with amplitude below a threshold are set to zero (removing small coefficient values related to noise). Finally, the inverse wavelet transform is performed to convert the de-noised signal back to the time-series audio domain.
  • The second module of the audio pre-processing module is the feature extraction module (325). The feature extraction module converts the de-noised audio signal to a feature vector that can be used as the input of a machine learning algorithm in the audio-to-activity learning module. The feature extraction module uses a Fourier Transform to convert the de-noised audio signal to a series of frequency domain coefficients. These coefficients are binned into segments of 10 Hz ranging from 0 Hz to 2000 Hz (where the magnitude of the coefficients between 0 and 10 Hz are added to the first bin, the coefficients of 11 Hz to 20 Hz to the second bin, etc.). The output of this module is a feature vector of size 200 containing the aggregated energy for segments of the frequency spectrum of the de-noised audio signal. Depending on performance, more sophisticated feature extraction (e.g., DFCC extraction, deep learning) may be performed in this step. The feature vector is also provided to the audio-to-activity learning engine module.
  • The audio-to-activity learning engine (330) has two modules. The first is the classify activity module (335). The classify activity module retrieves a classification model. The classify activity module also extracts sensor metadata from the sensor array. Using the classification model, the feature vector of the audio signal and the sensor metadata, the classify activity module of the audio-to-activity learning engine module classifies what activity is currently being performed. The classify activity module is a machine learning algorithm. Using the training set of previous acquired/processed audio features (the length-200 vector of aggregated frequency energy) of known activities, a machine learning classifier is built for each desired activity and for each type of sensor and placement. An example of this is a single learning classifier which detects from microphone-based audio associated with the kitchen if the water in the sink is running. Another sensor modality (such as vibration sensors) would require a different learning model, as would a different activity to classify (such as, water is boiling, the garbage disposal is running, etc.). The classify activity module uses classification trees. Other algorithms could be used, such as boosted decision trees or kernel-based SVMs. The second module of the audio-to-activity learning engine is the calculate confidence level (score) module (340). The calculate confidence level (score) module uses bootstrap classification (learning multiple classification models by resampling the training set), to calculate a confidence level (score) of the activity classification. Thus, the output of the audio-to-activity learning engine module is the audio signal, the classification of the activity and the confidence level of the activity classification.
  • It is to be understood that the proposed method and apparatus may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs). Preferably, the proposed method and apparatus is implemented as a combination of hardware and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
  • It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
  • It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the proposed method and apparatus is programmed. Given the teachings herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the proposed method and apparatus.

Claims (14)

  1. A method for identifying an activity, said method comprising:
    acquiring (205) segments of audio signals from audio sensors in an audio sensor array;
    de-noising (210) said segments of audio to remove artifacts and noise;
    converting (215) said de-noised segments of audio to a feature vector;
    extracting (220) sensor metadata from said audio sensor array;
    retrieving (225) a classification model;
    classifying (230) said activity using said feature vector and said sensor metadata and a training database; and
    calculating (235) a confidence score.
  2. The method according to claim 1, wherein said training database includes samples of audio of known activities and metadata.
  3. The method according to claim 1, wherein said confidence score is used to classify said activity engaged in by an individual being monitored, said activity being associated with said segments of audio signals acquired from said audio sensors.
  4. The method according to claim 3, wherein said activity is a household activity.
  5. The method according to claim 1, wherein said segments of said audio signals are time-windowed.
  6. The method according to claim 1, wherein said sensor metadata includes information regarding placement of said sensors, height of each sensor, sensitivity of each of said sensors and type of sensor.
  7. An apparatus for predicting household activity, comprising:
    means for acquiring (305) segments of audio from audio sensors in an audio sensor array;
    means for de-noising (320) said segments of audio to remove artifacts and noise;
    means for converting (325) said de-noised segments of audio to a feature vector;
    means for extracting (335) sensor metadata from said audio sensor array;
    means for retrieving (335) a classification model;
    means for classifying (335) said activity using said classification model, said feature vector and said sensor metadata and a training database and said classification model, said training database including samples of audio of known activities and metadata; and
    means for calculating (340) a confidence score.
  8. The apparatus according to claim 7, wherein said training database includes samples of audio of known activities and metadata.
  9. The apparatus according to claim 8, wherein said confidence score is used to identify said activity engaged in by an individual being monitored, said activity being associated with said segments of audio signals acquired from said audio sensors.
  10. The apparatus according to claim 9, wherein said activity is a household activity.
  11. The apparatus according to claim 7, wherein said array of sensors includes microphones and vibration sensors.
  12. The apparatus according to claim 7, wherein said segments of said audio signals are time-windowed.
  13. The apparatus according to claim 7, wherein said sensor metadata includes information regarding placement of said sensors, height of each sensor, sensitivity of each of said sensors and type of sensor.
  14. Computer program comprising program code instructions executable by a processor for implementing the steps of a method according to at least one of claims 1 to 6.
EP16305034.7A 2016-01-15 2016-01-15 Activity classification from audio Withdrawn EP3193317A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP16305034.7A EP3193317A1 (en) 2016-01-15 2016-01-15 Activity classification from audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP16305034.7A EP3193317A1 (en) 2016-01-15 2016-01-15 Activity classification from audio

Publications (1)

Publication Number Publication Date
EP3193317A1 true EP3193317A1 (en) 2017-07-19

Family

ID=55315367

Family Applications (1)

Application Number Title Priority Date Filing Date
EP16305034.7A Withdrawn EP3193317A1 (en) 2016-01-15 2016-01-15 Activity classification from audio

Country Status (1)

Country Link
EP (1) EP3193317A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109246570A (en) * 2018-08-29 2019-01-18 北京声智科技有限公司 The device and method of microphone quality inspection
US10475468B1 (en) 2018-07-12 2019-11-12 Honeywell International Inc. Monitoring industrial equipment using audio
CN113703568A (en) * 2021-07-12 2021-11-26 中国科学院深圳先进技术研究院 Gesture recognition method, gesture recognition device, gesture recognition system, and storage medium
US11450340B2 (en) 2020-12-07 2022-09-20 Honeywell International Inc. Methods and systems for human activity tracking
US11620827B2 (en) 2021-03-22 2023-04-04 Honeywell International Inc. System and method for identifying activity in an area using a video camera and an audio sensor
US11836982B2 (en) 2021-12-15 2023-12-05 Honeywell International Inc. Security camera with video analytics and direct network communication with neighboring cameras

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2963628A1 (en) * 2013-02-26 2016-01-06 Hitachi, Ltd. Monitoring system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2963628A1 (en) * 2013-02-26 2016-01-06 Hitachi, Ltd. Monitoring system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAN ISTRATE ET AL: "Information Extraction From Sound for Medical Telemonitoring", IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 10, no. 2, 1 April 2006 (2006-04-01), pages 264 - 274, XP007908807, ISSN: 1089-7771, DOI: 10.1109/TITB.2005.859889 *
SIANTIKOS GIORGOS ET AL: "Fusing multiple audio sensors for acoustic event detection", 2015 9TH INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS (ISPA), IEEE, 7 September 2015 (2015-09-07), pages 265 - 269, XP032798430, DOI: 10.1109/ISPA.2015.7306070 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10475468B1 (en) 2018-07-12 2019-11-12 Honeywell International Inc. Monitoring industrial equipment using audio
US10867622B2 (en) 2018-07-12 2020-12-15 Honeywell International Inc. Monitoring industrial equipment using audio
US11348598B2 (en) 2018-07-12 2022-05-31 Honeywell Internationa, Inc. Monitoring industrial equipment using audio
CN109246570A (en) * 2018-08-29 2019-01-18 北京声智科技有限公司 The device and method of microphone quality inspection
US11450340B2 (en) 2020-12-07 2022-09-20 Honeywell International Inc. Methods and systems for human activity tracking
US11804240B2 (en) 2020-12-07 2023-10-31 Honeywell International Inc. Methods and systems for human activity tracking
US11620827B2 (en) 2021-03-22 2023-04-04 Honeywell International Inc. System and method for identifying activity in an area using a video camera and an audio sensor
CN113703568A (en) * 2021-07-12 2021-11-26 中国科学院深圳先进技术研究院 Gesture recognition method, gesture recognition device, gesture recognition system, and storage medium
US11836982B2 (en) 2021-12-15 2023-12-05 Honeywell International Inc. Security camera with video analytics and direct network communication with neighboring cameras

Similar Documents

Publication Publication Date Title
EP3193317A1 (en) Activity classification from audio
Mouawad et al. Robust detection of COVID-19 in cough sounds: using recurrence dynamics and variable Markov model
Cheffena Fall detection using smartphone audio features
Stork et al. Audio-based human activity recognition using non-markovian ensemble voting
US11568731B2 (en) Systems and methods for identifying an acoustic source based on observed sound
Li et al. Efficient source separation algorithms for acoustic fall detection using a microsoft kinect
US20190103005A1 (en) Multi-resolution audio activity tracker based on acoustic scene recognition
Aminikhanghahi et al. Using change point detection to automate daily activity segmentation
US20200019887A1 (en) Data-driven activity prediction
US20140314271A1 (en) Systems and Methods for Pedestrian Detection in Images
Schroeder et al. Detection and classification of acoustic events for in-home care
Droghini et al. A combined one-class SVM and template-matching approach for user-aided human fall detection by means of floor acoustic features
CN104346503A (en) Human face image based emotional health monitoring method and mobile phone
JP6658331B2 (en) Action recognition device and action recognition program
JP2011237865A (en) Living space monitoring system
AU2013204156B2 (en) Classification apparatus and program
Li et al. Mosquito detection with low-cost smartphones: data acquisition for malaria research
CN103778916A (en) Method and system for monitoring environmental sound
Pan et al. Cognitive acoustic analytics service for Internet of Things
JP6367691B2 (en) Notification sound detection / identification device, notification sound detection / identification method, notification sound detection / identification program
Kojima et al. CogKnife: Food recognition from their cutting sounds
CN110800053A (en) Method and apparatus for obtaining event indications based on audio data
Siantikos et al. A low-cost approach for detecting activities of daily living using audio information: A use case on bathroom activity monitoring
Siriwardhana et al. Classification of activities of daily living based on depth sequences and audio
JP2018109739A (en) Device and method for audio frame processing

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20180120