CN112420078B - Monitoring method, device, storage medium and electronic equipment - Google Patents
Monitoring method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN112420078B CN112420078B CN202011296487.4A CN202011296487A CN112420078B CN 112420078 B CN112420078 B CN 112420078B CN 202011296487 A CN202011296487 A CN 202011296487A CN 112420078 B CN112420078 B CN 112420078B
- Authority
- CN
- China
- Prior art keywords
- target
- audio signal
- audio
- signal
- type information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000005236 sound signal Effects 0.000 claims abstract description 202
- 238000001514 detection method Methods 0.000 claims abstract description 68
- 238000004590 computer program Methods 0.000 claims description 16
- 238000013145 classification model Methods 0.000 claims description 12
- 230000002708 enhancing effect Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000000926 separation method Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 3
- 238000012806 monitoring device Methods 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 31
- 206010011469 Crying Diseases 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 230000009471 action Effects 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 230000001276 controlling effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000007774 longterm Effects 0.000 description 4
- 230000001603 reducing effect Effects 0.000 description 4
- 241000282472 Canis lupus familiaris Species 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000037081 physical activity Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000000779 smoke Substances 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Circuit For Audible Band Transducer (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The embodiment of the invention provides a monitoring method, a monitoring device, a storage medium and electronic equipment, wherein the method comprises the following steps: monitoring an audio signal in real time; performing first detection on the monitored audio signal to acquire a target audio signal which is included in the audio signal and is emitted by a target object; performing second detection on the target audio signal to acquire sound type information corresponding to the target audio signal; and sending the target prompt information to the terminal according to the sound type information. The invention solves the problems that the monitoring field is narrow and the specific object cannot be effectively monitored, thereby achieving the effect of enlarging the monitoring application range.
Description
Technical Field
The embodiment of the invention relates to the field of acoustic event monitoring, in particular to a monitoring method, a monitoring device, a storage medium and electronic equipment.
Background
With the continuous development of the field analysis of computer auditory scenes, acoustic event classification can be performed for acoustic scenes and event detection. At present, the classification and detection of audio scene events are widely applied, such as: smart homes, unmanned driving, and other more complex scenarios.
In a home environment, there are many specific categories of events, such as: when the baby cry, smoke alarm, dog call, winter water pipe burst and the like occur, the user hopes to know the home condition in a mode that the user can be informed by a scene monitoring method so as to better respond.
At present, a monitoring method related to a home environment is mainly performed through a technical means of image recognition. The method mainly comprises the steps of monitoring the specified condition by collecting the human body activity information track, and carrying out voice analysis on the specified person after judgment based on an image recognition strategy.
However, the above monitoring method is only suitable for monitoring objects with motion trajectories, i.e. objects capable of freely moving, but in practical applications, there are some objects that cannot generate motion trajectories, such as infants, patients unable to autonomously move, elderly people lying in bed, and so on, and for such objects, the monitoring method in the prior art cannot be used to realize effective monitoring.
Therefore, the related art has the problems that the monitoring field is narrow and the specific object cannot be effectively monitored.
Disclosure of Invention
The embodiment of the invention provides a monitoring method, a monitoring device, a storage medium and electronic equipment, which are used for at least solving the problems that the monitoring field is narrow and the specific object cannot be effectively monitored in the related technology.
According to an embodiment of the present invention, there is provided a listening method including:
monitoring an audio signal;
performing first detection on the monitored audio signal to acquire a target audio signal sent by a target object and included in the audio signal;
performing second detection on the target audio signal to acquire sound type information corresponding to the target audio signal;
sending target prompt information to the terminal according to the sound type information
In an exemplary embodiment, the second detecting the target audio signal to obtain the sound type information corresponding to the target audio signal includes:
extracting audio features of the target audio signal, and enhancing the audio features to obtain target audio features;
inputting the target audio features into a target classification model to determine initial sound type information corresponding to the target audio signals;
under the condition that the initial sound type information indicates that the target audio signal is an audio signal of a target type, determining a pitch level of the target audio signal according to a detection result of the first detection and the initial sound type information, wherein the detection result of the first detection comprises a signal-to-noise ratio of the target audio signal;
determining the pitch class and the initial sound type information as the sound type information.
In one exemplary embodiment, listening for the audio signal comprises:
listening for a continuous audio signal in a target environment, wherein the continuous audio signal is listened for in units of frames.
In one exemplary embodiment, listening for the audio signal further comprises:
detecting whether a first object exists in a target environment, wherein the target environment is the environment where the target object is located;
listening for the audio signal if it is determined that the first object is not present in the target environment.
In one exemplary embodiment, the extracting the audio feature of the target audio signal includes:
inputting the target audio signal to a target signal channel;
and performing target operation on the target audio signal input into the target signal channel to obtain the audio characteristic of the target audio signal in the target signal channel.
In an exemplary embodiment, the performing the target operation on the target audio signal input into the target signal channel to obtain the audio feature of the target audio signal in the target signal channel includes:
carrying out averaging operation on the target audio signal in the target signal channel to obtain similar characteristics of the target audio signal in the target signal channel;
and carrying out difference operation on the target audio signal in the target signal channel to obtain the difference characteristic of the target audio signal in the target signal channel.
According to another embodiment of the present invention, there is provided a listening apparatus including:
the audio monitoring module is used for monitoring audio signals;
the first detection module is used for carrying out first detection on the monitored audio signals so as to acquire target audio signals which are included in the audio signals and are emitted by target objects;
the second detection module is used for carrying out second detection on the target audio signal so as to acquire sound type information corresponding to the target audio signal;
and the information sending module is used for sending the target prompt information to the terminal according to the sound type information.
In one exemplary embodiment, the second detection module comprises:
the characteristic extraction unit is used for extracting the audio characteristic of the target audio signal and enhancing the audio characteristic to obtain a target audio characteristic;
the characteristic classification unit is used for inputting the target audio characteristic into a target classification model so as to determine initial sound type information corresponding to the target audio signal;
a pitch determining unit, configured to determine a pitch level of the target audio signal according to the detection result of the first detection and the initial sound type information when the initial sound type information indicates that the target audio signal is an audio signal of a target type, where the detection result of the first detection includes a signal-to-noise ratio of the target audio signal;
a type determining unit for determining the pitch class and the initial sound type information as the sound type information.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, comprising a memory in which a computer program is stored and a processor configured to run the computer program to perform the steps of any of the method embodiments described above.
According to the invention, the target object and the target audio are monitored by generating the target object of the audio clip and the sound type information of the audio clip and determining whether the audio clip is the target audio and the target object, so that the problems that the monitoring field is narrow and the specific object cannot be effectively monitored in the related technology can be solved, and the effect of expanding the monitoring application range is achieved.
Drawings
Fig. 1 is a block diagram of a hardware structure of a mobile terminal of a monitoring method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of listening according to an embodiment of the present invention;
fig. 3 is a block diagram of a monitoring apparatus according to an embodiment of the present invention;
fig. 4 is a flow chart of an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of operating on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a monitoring method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to a listening method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In this embodiment, a listening method is provided, and fig. 2 is a flowchart according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, monitoring an audio signal;
in this embodiment, by monitoring the audio signal in the target environment, the probability of identifying the target object in the target environment may be increased, wherein (but not limited to) real-time monitoring may be performed on the audio signal to avoid an identification error caused by omission of the audio signal, and the probability of identifying the target object in the target environment may be increased, or the audio signal in the target environment may be intermittently monitored in unit time, and the real-time monitoring may be performed until the target audio signal is monitored, so as to save energy, or the audio signal acquisition may be performed on the target environment according to a fixed period, so as to save energy.
For example, when the target environment is indoors and an animal (such as a pet dog or a pet cat) or a household device (such as an alarm clock) capable of making a sound exists in the room, since the sounds of the animal and the household device stop after a certain period of time, monitoring the audio signal in the room in real time can facilitate distinguishing the sound of the target object (such as crying sound of an infant) from other noises, thereby improving the recognition probability of the target object; meanwhile, because the frequency, amplitude and other parameters of the target object and other noises are different, the target object can be distinguished through real-time monitoring; further, in the case of acquiring an audio signal in a room, the target object may be (but is not limited to) analyzed in combination with the video signal, thereby further improving the recognition probability of the sound of the target object.
The acquisition of the audio signal can be performed by audio acquisition equipment such as a microphone and an audio acquisition module, or by equipment or devices such as a voiceprint acquisition equipment and a sound card. In order to realize the function of real-time acquisition, a timing module or timing equipment can be arranged on the audio signal acquisition equipment so that the audio signal acquisition equipment triggers the real-time acquisition of the audio signal under the condition of reaching a target time point; the audio signal acquisition equipment can also be connected with the environment judgment equipment so that the audio signal acquisition equipment triggers the real-time acquisition of the audio signal under the condition that the target environment meets the preset condition (such as no people indoors); the audio signal acquisition device can also be used for acquiring the audio signal in real time in other ways.
It should be noted that, the real-time monitoring of the target environment is to reduce the situations that the audio signal identification is wrong or the audio signal cannot be identified due to the omission of the audio signal, so that the continuous audio signal can be obtained in the shortest time during the audio monitoring, thereby reducing the wrong identification of the audio signal and improving the identification efficiency of the audio signal.
For example, if the indoor environment is monitored in real time, when an audio signal of crying of the baby appears, the audio segment can be continuously collected in time, and the omission of monitoring the audio segment is reduced.
Step S204, carrying out first detection on the monitored audio signals to obtain target audio signals emitted by a target object and monitored audio signals in the audio signals;
in this embodiment, when there are many sound-generating objects in the target environment, the monitored audio signal often contains various noises, and therefore the target audio signal needs to be separated from the monitored audio signal to reduce the interference on the identification of the target object.
For example, when the target environment is indoors and animals (such as pet dogs and pet cats) or household equipment (such as alarm clocks) capable of making sounds are present in the room, signal processing such as demodulation and filtering is performed, and parameters such as signal frequency in the monitored audio signals are compared, so that audio signals close to or identical to parameters (such as frequency and amplitude) of the target audio signals (such as sounds of infants) are separated from the monitored audio signals and are used as the target audio signals, and interference of other noise signals is reduced.
The first detection of the monitored audio signal may be performed by an audio processing module, such as an FPGA or a single chip microcomputer with an audio processing function, or may be performed by audio processing software, or may be performed by other methods.
It should be noted that the monitored audio signal may also include electromagnetic noise through which current passes when the device is operating, so that the electromagnetic noise may be separated during the first detection, so as to further reduce interference to the identification of the target audio signal; the separation of the electromagnetic noise can also be realized by audio processing software or an audio processing module.
Step S206, carrying out second detection on the target audio signal to obtain sound type information corresponding to the target audio signal;
in this embodiment, the obtaining of the sound type information corresponding to the target audio signal is to determine an action performed when the target object emits the target audio object, so as to obtain state information of the target object, and facilitate subsequent execution of a corresponding action according to a state of the target object; the sound type information includes information such as an action state of a target object represented by the target audio signal, an effective time of the target audio signal, an audio parameter of the target audio signal, a position of the target object, a pitch level, and the like.
For example, when the sound type information corresponding to the acquired target audio signal is a crying sound of the infant, the infant can be pacified according to the sound type information; or when the sound type information corresponding to the acquired target audio signal is the laughter of the baby, the baby can be matched according to the sound type information, and so on.
The obtaining of the sound type information corresponding to the target audio signal may be storing the sound type information corresponding to the target audio signal in advance, and comparing the detection result with the sound type information when receiving the detection result of the second detection, so as to obtain the sound type information corresponding to the target audio signal; or directly classify the target audio signal in the second detection process.
It should be noted that the device or apparatus for performing the second detection may be a processing module such as a single chip microcomputer or an FPGA, a cloud processing device having a computing and storing function, or other devices or apparatuses having an audio signal processing function.
And step S208, sending the target prompt information to the terminal according to the sound type information.
In this embodiment, after the sound type information is determined, the corresponding target prompt information is sent to the terminal, so that the terminal can execute a corresponding action according to the sound type information, thereby reducing loss caused by inaccurate identification or failure to receive the sound type information in time.
The target prompt information may be audio information, such as sounds corresponding to different sound types of information or only sound segments playing a prompting role, or may be text information, such as texts corresponding to different sound types of information or texts playing a prompting role only, or may be an action signal for causing the terminal to generate vibration, such as a driving signal for driving a self-contained micro motor in the terminal to rotate, or may be a combination of the three situations; the method for sending the target prompt information to the terminal can be that the sound type information is sent to a cloud or a management platform through the communication module, and then the sound type information is sent to the terminal through the cloud or the management platform; the target prompt message can also be directly sent to the terminal through the communication module, or the transmission of the target prompt message can also be realized through other modes.
For example, after the terminal receives the target prompt information indicating that the infant is crying, a parent of the infant can timely perform a soothing action on the infant, or perform a soothing action on the infant through an audio playing device or other devices (such as a device capable of flapping an infant, or a device capable of feeding a baby, etc.); thereby the parents need not accompany the baby constantly, the energy of the parents is saved.
Through the steps, whether the audio frequency fragments are the target audio frequency and the target object or not is determined by collecting and classifying the audio frequency signals of the target object, and then the target object and the target audio frequency are monitored according to different audio frequency objects, so that the problems that the monitoring field is narrow and the specific object cannot be effectively monitored in the related technology are solved, and the monitoring application range is expanded.
The main body of the above steps may be a base station, a terminal, etc., but is not limited thereto.
In an optional embodiment, the second detecting the target audio signal to obtain the sound type information corresponding to the target audio signal includes:
step S2062, extracting the audio frequency characteristic of the target audio frequency signal, and enhancing the audio frequency characteristic to obtain the target audio frequency characteristic;
in this embodiment, the audio features of the target audio signal are extracted to facilitate identification and classification of the target audio signal, and the enhancement of the audio features is to quickly identify the audio features, so as to further identify and classify the target audio signal and improve the identification and classification efficiency; the device for extracting the features of the target audio signal can be a feature extraction module, or other devices or apparatuses; the audio characteristics may include, but are not limited to, signal strength, frequency, amplitude, etc. parameter characteristics of the target audio signal; the audio signal may be enhanced by harmonic impulse source separation or by other means.
Step S2064, inputting the target audio frequency characteristics into the target classification model so as to determine the initial sound type information corresponding to the target audio frequency signals;
in this embodiment, a target audio signal is input to a target classification model, and the target audio signal is calculated by the target classification model, so that initial sound type information corresponding to the target audio signal can be determined; the target classification model may be (but is not limited to) a Long-short Time Memory (LSTM) model, and may also be other classification models; the initial sound type information includes the type of the audio signal, target object information, and the like.
For example, three input quantities of the LSTM model are set respectively: input value x of network at present t Last time LSTM output value h t-1 And cell state C at the previous time t-1 ;
And two outputs of the LSTM model: current time LSTM output value h t And cell state C at the current time t 。
The key of the LSTM is to control the long-term state C, so that three control switches are arranged, the first switch is responsible for controlling to continuously store the long-term state, the second switch is responsible for controlling to input the instant state into the long-term state C, and the third switch is responsible for controlling whether the long-term state C is used as the output of the current LSTM.
It should be noted that the LSTM model mainly includes a forgetting gate, an input gate and an output gate, and the formula is as follows:
forgetting the door: blockDefining the cell state C at the previous moment t-1 How much to keep current time C t (ii) a The specific formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (formula 1)
An input gate: determining the input x of the network at the current moment t How much to save to cell state C t (ii) a The specific formula is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i ) (formula 2)
An output gate: for controlling the cell state C t How much current output value h is output to LSTM t (ii) a The concrete formula is as follows:
step S2066, in the case that the initial sound type information indicates that the target audio signal is the audio signal of the target type, determining the pitch grade of the target audio signal according to the detection result of the first detection and the initial sound type information, wherein the detection result of the first detection comprises the signal-to-noise ratio of the target audio signal;
in this embodiment, the state of the target object is determined by determining the pitch level of the target audio signal, so that a corresponding action can be performed according to the state of the target object, and the accuracy of the pitch level can be determined by the signal-to-noise ratio obtained by the first detection. The first Detection may be to input the monitored audio signal to a Voice Activity Detection (VAD) module for endpoint Detection, or may be other Detection manners.
In step S2068, the pitch level and the initial sound type information are determined as the sound type information.
In this embodiment, the pitch level and the initial sound type information are determined as part of the sound type information, so that the terminal can quickly execute a corresponding action according to the sound type information after receiving the sound type information, thereby reducing action delay caused by information identification and improving the sensitivity of action response.
In an alternative embodiment, listening for the audio signal comprises:
step S2022, monitor the continuous audio signal in the target environment, wherein the continuous audio signal is monitored in units of frames.
In the embodiment, omission can be avoided by monitoring continuous audio signals; and the monitoring in units of frames can facilitate the subsequent inspection of the monitored audio signals.
In an optional embodiment, listening for the audio signal further comprises:
step S2024, detecting whether a first object exists in a target environment, where the target environment is an environment where the target object is located;
step S2026, in case it is determined that the first object is not present in the target environment, listens for the audio signal.
In this embodiment, when it is detected that the first object does not exist in the target environment, it indicates that the target environment is in an unmanned state, and then triggers a listening action, thereby saving energy and reducing interference caused by an audio signal similar to the target audio signal.
The detection of the target first object may be performed by image tracking recognition, infrared temperature tracking detection, or other methods.
For example, when an adult who cannot take care of an infant in a room is detected through image tracking identification, the indoor is determined to be in an unmanned state at the moment, and then real-time audio monitoring of the indoor environment is triggered to ensure the safety of the infant.
In an alternative embodiment, extracting the audio feature of the target audio signal comprises:
step S20622, inputting a target audio signal to the target signal channel;
in step S20624, a target operation is performed on the target audio signal input into the target signal channel to obtain the audio feature of the target audio signal in the target signal channel.
In this embodiment, the channel selection module selects the audio segment in the signal channel with the strongest signal energy, so as to ensure that the result of the target operation meets the requirement and reduce the interference of other noises; the number of the target signal channels may be one or more.
In an alternative embodiment, performing the target operation on the target audio signal in the target signal channel to obtain the audio feature of the target audio signal in the target signal channel includes:
step S206242, performing an averaging operation on the target audio signal in the target signal channel to obtain similar features of the target audio signal in the target signal channel;
step S206244, performing a difference operation on the target audio signal in the target signal channel to obtain a difference feature of the target audio signal in the target signal channel.
In this embodiment, similar features are obtained by performing a mean operation, and difference features are obtained by a difference operation, so that the target audio signal is conveniently compared with a preset target audio signal, and the target audio signal can be rapidly classified according to the features.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a monitoring apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and details are not described again after the description is given. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a listening device according to an embodiment of the present invention, and as shown in fig. 3, the listening device includes:
the audio monitoring module 32 is used for monitoring audio signals;
a first detection module 34, configured to perform first detection on the monitored audio signal to obtain a target audio signal emitted by a target object included in the audio signal;
the second detection module 36 is configured to perform second detection on the target audio signal to obtain sound type information corresponding to the target audio signal;
and the information sending module 38 is used for sending the target prompt information to the terminal according to the sound type information.
In an alternative embodiment, the second detection module 36 includes:
a feature extraction unit 362, configured to extract an audio feature of the target audio signal, and enhance the audio feature to obtain a target audio feature;
a feature classification unit 364, configured to input a target audio feature into a target classification model to determine initial sound type information corresponding to a target audio signal;
a pitch determining unit 366, configured to determine a pitch level of the target audio signal according to a detection result of the first detection and the initial sound type information when the initial sound type information indicates that the target audio signal is an audio signal of the target type, where the detection result of the first detection includes a signal-to-noise ratio of the target audio signal;
a type determining unit 368 is configured to determine a pitch level and the initial sound type information as the sound type information.
In an alternative embodiment, the audio listening module 32 comprises:
a continuous signal listening unit 322 for listening a continuous audio signal in a target environment, wherein the continuous audio signal is listened for in units of frames.
In an optional embodiment, the audio listening module 32 further comprises:
a track detection unit 324, configured to detect whether a first object exists in a target environment, where the target environment is an environment where the target object is located;
a listening triggering unit 326 for listening to the audio signal if it is determined that the first object is not present in the target environment.
In an alternative embodiment, the feature extraction unit 362 includes:
an audio input subunit 3622 configured to input audio segments of the target group to the target signal channel;
the audio operation subunit 3624 is configured to perform target operation on the audio segment input into the target signal channel to obtain an audio feature of the audio segment in the target signal channel.
In an exemplary embodiment, the audio operation subunit 3624 includes:
the mean value operation subunit 36242 is configured to perform mean value operation on the audio segments in the target signal channel to obtain similar features of the audio segments in the target signal channel;
the difference calculating subunit 36244 is configured to perform difference calculating on the audio segment in the target signal channel to obtain a difference feature of the audio segment in the target signal channel.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
The following description will be given with reference to specific examples.
As shown in fig. 4, the terminal device for acquiring the audio signal acquires the continuous speech signal in real time, and performs acquisition in units of frames (corresponding to step S401 in fig. 4); then, the audio signal is input to the VAD module for endpoint detection, and the audio is determined to be an audio segment conforming to the sound of the infant, and meanwhile, the signal-to-noise ratio related information is obtained (corresponding to step S402 in fig. 4); inputting the audio signal to a feature extraction module (corresponding to step S403 in fig. 4), where the feature extraction module performs averaging and difference calculation using the features of the audio data of the two target signal channels to obtain similar points and difference points of the audio data between the two channel features, and obtains enhanced audio features by a harmonic impulse source separation method;
then, transmitting the obtained enhanced audio features to an offline acoustic event detection classification model, and obtaining whether the crying result of the infant is obtained from the acoustic event detection classification model (corresponding to step S404 in fig. 4); if the classified model is judged to be a crying scene of the infant through the acoustic event detection, combining signal-to-noise ratio related information obtained by VAD and giving out a crying pitch level; then, the crying and crying pitch level of the infant is transmitted to a cloud (corresponding to step S405 in fig. 4); and then, the obtained information is sent to the user mobile phone APP through the cloud (corresponding to step S406 in fig. 4).
An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing device, they may be centralized in a single computing device or distributed across a network of multiple computing devices, and they may be implemented in program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be executed in an order different from that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps therein may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. A listening method, comprising:
monitoring an audio signal;
performing first detection on the monitored audio signal to acquire a target audio signal sent by a target object and included in the audio signal;
performing second detection on the target audio signal to acquire sound type information corresponding to the target audio signal;
sending target prompt information to a terminal according to the sound type information;
the second detecting the target audio signal to obtain the sound type information corresponding to the target audio signal includes: extracting audio features of the target audio signal, and enhancing the audio features to obtain target audio features; inputting the target audio features into a target classification model to determine initial sound type information corresponding to the target audio signals; determining a pitch level of the target audio signal according to the detection result of the first detection and the initial sound type information under the condition that the initial sound type information indicates that the target audio signal is an audio signal of a target type, wherein the detection result of the first detection comprises a signal-to-noise ratio of the target audio signal; determining the pitch level and the initial sound type information as the sound type information;
extracting audio features of the target audio signal, wherein enhancing the audio features comprises: and carrying out averaging and difference calculation on the audio features of the target audio signals of the two signal channels to obtain similar points and difference points of the audio features of the two signal channels, and enhancing the audio features through a harmonic impact source separation mode based on the similar points and the difference points.
2. The method of claim 1, wherein listening for the audio signal comprises:
listening for a continuous audio signal in a target environment, wherein the continuous audio signal is listened for in units of frames.
3. The method of claim 1, wherein listening for the audio signal further comprises:
detecting whether a first object exists in a target environment, wherein the target environment is the environment where the target object is located;
listening, in real-time, for the audio signal if it is determined that the first object is not present in the target environment.
4. The method of claim 1, wherein the extracting the audio feature of the target audio signal comprises:
inputting the target audio signal to a target signal channel;
and performing target operation on the target audio signal input into the target signal channel to obtain the audio characteristics of the target audio signal in the target signal channel.
5. The method of claim 4, wherein the performing a target operation on the target audio signal input into the target signal channel to obtain the audio feature of the audio segment in the target signal channel comprises:
carrying out averaging operation on the target audio signal in the target signal channel to obtain similar characteristics of the target audio signal in the target signal channel;
and carrying out difference operation on the target audio signal in the target signal channel to obtain the difference characteristic of the target audio signal in the target signal channel.
6. A listening device, comprising:
the audio monitoring module is used for monitoring audio signals;
the first detection module is used for carrying out first detection on the monitored audio signals so as to acquire target audio signals which are included in the audio signals and are emitted by target objects;
the second detection module is used for carrying out second detection on the target audio signal so as to acquire sound type information corresponding to the target audio signal;
the information sending module is used for sending target prompt information to the terminal according to the sound type information;
the second detection module includes: the characteristic extraction unit is used for extracting the audio characteristic of the target audio signal and enhancing the audio characteristic to obtain a target audio characteristic; the characteristic classification unit is used for inputting the target audio characteristic into a target classification model so as to determine initial sound type information corresponding to the target audio signal; a pitch determining unit, configured to determine a pitch level of the target audio signal according to the detection result of the first detection and the initial sound type information when the initial sound type information indicates that the target audio signal is an audio signal of a target type, where the detection result of the first detection includes a signal-to-noise ratio of the target audio signal; a type determining unit for determining the pitch class and the initial sound type information as the sound type information;
the feature extraction unit extracts the audio features of the target audio signal and enhances the audio features by: and carrying out averaging and difference calculation on the audio features of the target audio signals of the two signal channels to obtain similar points and difference points of the audio features of the two signal channels, and enhancing the audio features through a harmonic impact source separation mode based on the similar points and the difference points.
7. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.
8. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011296487.4A CN112420078B (en) | 2020-11-18 | 2020-11-18 | Monitoring method, device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011296487.4A CN112420078B (en) | 2020-11-18 | 2020-11-18 | Monitoring method, device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112420078A CN112420078A (en) | 2021-02-26 |
CN112420078B true CN112420078B (en) | 2022-12-30 |
Family
ID=74773021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011296487.4A Active CN112420078B (en) | 2020-11-18 | 2020-11-18 | Monitoring method, device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112420078B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106941007A (en) * | 2017-05-12 | 2017-07-11 | 北京理工大学 | A kind of audio event model composite channel adaptive approach |
CN107978311A (en) * | 2017-11-24 | 2018-05-01 | 腾讯科技(深圳)有限公司 | A kind of voice data processing method, device and interactive voice equipment |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8990074B2 (en) * | 2011-05-24 | 2015-03-24 | Qualcomm Incorporated | Noise-robust speech coding mode classification |
US9536540B2 (en) * | 2013-07-19 | 2017-01-03 | Knowles Electronics, Llc | Speech signal separation and synthesis based on auditory scene analysis and speech modeling |
GB2521175A (en) * | 2013-12-11 | 2015-06-17 | Nokia Technologies Oy | Spatial audio processing apparatus |
CN105799584B (en) * | 2014-12-29 | 2019-03-15 | 博世汽车部件(苏州)有限公司 | Surrounding vehicles whistle sound microprocessor, suggestion device and automated driving system |
CN106971714A (en) * | 2016-01-14 | 2017-07-21 | 芋头科技(杭州)有限公司 | A kind of speech de-noising recognition methods and device applied to robot |
CN106095387B (en) * | 2016-06-16 | 2019-06-25 | Oppo广东移动通信有限公司 | A kind of the audio setting method and terminal of terminal |
CN107331405A (en) * | 2017-06-30 | 2017-11-07 | 深圳市金立通信设备有限公司 | A kind of voice information processing method and server |
WO2019008580A1 (en) * | 2017-07-03 | 2019-01-10 | Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. | Method and system for enhancing a speech signal of a human speaker in a video using visual information |
CN108109622A (en) * | 2017-12-28 | 2018-06-01 | 武汉蛋玩科技有限公司 | A kind of early education robot voice interactive education system and method |
CN108647005A (en) * | 2018-05-15 | 2018-10-12 | 努比亚技术有限公司 | Audio frequency playing method, mobile terminal and computer readable storage medium |
CN110989900B (en) * | 2019-11-28 | 2021-11-05 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111383669B (en) * | 2020-03-19 | 2022-02-18 | 杭州网易云音乐科技有限公司 | Multimedia file uploading method, device, equipment and computer readable storage medium |
CN111491258B (en) * | 2020-03-26 | 2022-07-12 | 微民保险代理有限公司 | Object type detection method and device |
CN111586515A (en) * | 2020-04-30 | 2020-08-25 | 歌尔科技有限公司 | Sound monitoring method, equipment and storage medium based on wireless earphone |
CN111681672A (en) * | 2020-05-26 | 2020-09-18 | 深圳壹账通智能科技有限公司 | Voice data detection method and device, computer equipment and storage medium |
-
2020
- 2020-11-18 CN CN202011296487.4A patent/CN112420078B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106941007A (en) * | 2017-05-12 | 2017-07-11 | 北京理工大学 | A kind of audio event model composite channel adaptive approach |
CN107978311A (en) * | 2017-11-24 | 2018-05-01 | 腾讯科技(深圳)有限公司 | A kind of voice data processing method, device and interactive voice equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112420078A (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10785643B2 (en) | Deep learning neural network based security system and control method therefor | |
US10074383B2 (en) | Sound event detection | |
CN106024016B (en) | Child nursing type robot and method for identifying crying of child | |
US9020622B2 (en) | Audio monitoring system and method of use | |
EP3940698A1 (en) | A computer-implemented method of providing data for an automated baby cry assessment | |
US12014732B2 (en) | Energy efficient custom deep learning circuits for always-on embedded applications | |
CN106214436A (en) | A kind of intelligent blind guiding system based on mobile phone terminal and blind-guiding method thereof | |
CN109595757B (en) | Control method and device of air conditioner and air conditioner with control device | |
CN108597164B (en) | Anti-theft method, anti-theft device, anti-theft terminal and computer readable medium | |
CN112069949A (en) | Artificial intelligence-based infant sleep monitoring system and monitoring method | |
US20210098005A1 (en) | Device, system and method for identifying a scene based on an ordered sequence of sounds captured in an environment | |
JP2020524300A (en) | Method and device for obtaining event designations based on audio data | |
CN112420078B (en) | Monitoring method, device, storage medium and electronic equipment | |
JP6861398B2 (en) | Information processing method and information processing device | |
CN106125566A (en) | A kind of household background music control system | |
CN111371894B (en) | Intelligent infant monitoring method and system based on Internet of things and storage medium | |
US20210104255A1 (en) | Assistive technology | |
CN110958348B (en) | Voice processing method and device, user equipment and intelligent sound box | |
KR20140136332A (en) | An acoustic feature extraction method for target acoustic recognition, apparatus for controlling objects by target acoustic recognition and method thereof | |
Rodriguez et al. | Waah: Infants cry classification of physiological state based on audio features | |
CN108520755B (en) | Detection method and device | |
WO2018039934A1 (en) | Pet placation method, apparatus and system | |
CN112185364A (en) | Method and device for detecting baby crying | |
CN112741557B (en) | Child state monitoring method and device based on sweeping robot | |
Busch et al. | Signal processing and behaviour recognition in animal welfare monitoring system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |