WO2023137908A1 - 声音识别方法、装置、介质、设备、程序产品及车辆 - Google Patents

声音识别方法、装置、介质、设备、程序产品及车辆 Download PDF

Info

Publication number
WO2023137908A1
WO2023137908A1 PCT/CN2022/090554 CN2022090554W WO2023137908A1 WO 2023137908 A1 WO2023137908 A1 WO 2023137908A1 CN 2022090554 W CN2022090554 W CN 2022090554W WO 2023137908 A1 WO2023137908 A1 WO 2023137908A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
category
sample
sound
model
Prior art date
Application number
PCT/CN2022/090554
Other languages
English (en)
French (fr)
Inventor
闫志勇
丁翰林
王永庆
张俊博
王育军
Original Assignee
小米汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 小米汽车科技有限公司 filed Critical 小米汽车科技有限公司
Publication of WO2023137908A1 publication Critical patent/WO2023137908A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • G06F3/1407General aspects irrespective of display type, e.g. determination of decimal point position, display with fixed or driving decimal point, suppression of non-significant zeros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and in particular, to a voice recognition method, device, medium, equipment, program product and vehicle.
  • radar or camera is mainly used to obtain information of surrounding objects, and automatic driving or assisted driving is realized through image processing. Both radar and camera have detection blind spots, and in some specific scenarios, there is a problem of low environmental detection reliability.
  • the present disclosure provides a voice recognition method, device, medium, equipment, program product and vehicle.
  • a voice recognition method comprising:
  • the target category is displayed by means of a display device.
  • displaying the target category through a display device includes:
  • the target image is displayed by the display device.
  • the determining the target image corresponding to the target category includes:
  • a target image corresponding to the target category is determined according to a category image correspondence, where the category image correspondence includes a correspondence between the target category and the target image.
  • displaying the target image through the display device includes:
  • the target image is displayed in a preset area of the display device.
  • the display device includes one or more of an exterior reflector, an interior reflector and a central control screen of the vehicle.
  • displaying the target image through the display device includes:
  • the target image is displayed through the exterior mirrors on both sides of the vehicle; or,
  • the target image is displayed through the exterior mirror on the driver's side of the vehicle.
  • the sound detection devices are arranged in the exterior reflectors on any one side or multiple sides of the vehicle, and the ambient sound is the ambient sound around the vehicle.
  • the multiple sound detection devices are respectively arranged in the left exterior reflector and the right exterior reflector of the vehicle.
  • performing classification processing on the environmental sound according to the target sound classification model to obtain the target category corresponding to the environmental sound includes:
  • the target sound classification model Inputting the ambient sound into the target sound classification model to obtain one or more first candidate categories, and the first target similarity between the ambient sound and each first candidate category;
  • the target category is determined from the first candidate category according to the first target similarity.
  • the determining the target category from the first candidate category according to the first target similarity includes:
  • the target category is determined according to the second candidate category.
  • the determining the target category according to the second candidate categories includes:
  • the category relationship between each second candidate category and other second candidate categories is determined;
  • the preset category correspondence includes a category relationship between any two second candidate categories, and the category relationship includes a confusion relationship and a similar relationship;
  • the target category is determined according to the second candidate category and the category relationship.
  • the target sound classification model is obtained after training according to a target neural network model
  • the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the target sound classification model is obtained by training in the following manner:
  • the target sound classification model is determined.
  • performing model compression on the first undetermined model to obtain a target neural network model includes:
  • the target neural network model is obtained according to a preset number of convolutional layers of the first undetermined model; wherein the preset number is smaller than the total number of convolutional layers of the first undetermined model.
  • the preset training steps include:
  • the model training steps include:
  • the trained target model When it is determined according to the sample category and the predicted category that the trained target model does not meet the preset stop iteration condition, determine the target loss value according to the sample category and the predicted category, update the parameters of the target model according to the target loss value, obtain a trained target model, and use the trained target model as a new target model.
  • the obtaining the first sample similarity between the sample sound and a plurality of the sample categories includes:
  • the obtaining the first sample similarity between the sample sound and multiple sample categories according to the sample characteristics of multiple periods includes:
  • the first sample similarity between the sample sound and the multiple sample categories is calculated according to the second sample similarity of features of multiple samples.
  • the sample category includes the target category and non-target category.
  • the determining the target sound classification model according to the second undetermined model includes:
  • Model quantization processing is performed on the model parameters of the second undetermined model to obtain the target sound classification model.
  • a voice recognition device comprising:
  • the sound collection module is configured to collect ambient sound through the sound detection device
  • the sound classification module is configured to classify the environmental sound according to the target sound classification model to obtain the target category corresponding to the environmental sound;
  • a display module configured to display the target category through a display device.
  • the display module is configured to determine a target image corresponding to the target category; and display the target image through the display device.
  • the presentation module is configured to determine the target image corresponding to the target category according to the category image correspondence, the category image correspondence includes the correspondence between the target category and the target image.
  • the display module is configured to display the target image in a preset area of the display device.
  • the display device includes one or more of an exterior reflector, an interior reflector and a central control screen of the vehicle.
  • the display module is configured to display the target image through the exterior reflectors on both sides of the vehicle when it is detected that there are passengers in the passenger seat; or, in the case that no passenger is detected in the passenger seat, display the target image through the exterior reflector on the driver's side of the vehicle.
  • the sound detection devices are arranged in the exterior reflectors on any one side or multiple sides of the vehicle, and the ambient sound is the ambient sound around the vehicle.
  • the multiple sound detection devices are respectively arranged in the left exterior reflector and the right exterior reflector of the vehicle.
  • the sound classification module is configured to input the environmental sound into the target sound classification model to obtain one or more first candidate categories, and a first target similarity between the environmental sound and each first candidate category; and determine the target category from the first candidate categories according to the first target similarity.
  • the sound classification module is configured to rank the top N positions of the first target similarity from large to small, and the first candidate category whose first target similarity is greater than or equal to a preset similarity threshold, as the second candidate category; determine the target category according to the second candidate category.
  • the sound classification module is configured to determine the category relationship between each second candidate category and other second candidate categories according to the preset category correspondence; the preset category correspondence includes category relationships between any two second candidate categories, and the category relationships include confusion relationships and similar relationships; according to the second candidate categories and category relationships, determine the target category.
  • the target sound classification model is obtained after training according to a target neural network model
  • the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the device further includes a model training module; the model training module is configured to:
  • the target sound classification model is determined.
  • the model training module is configured to acquire the target neural network model according to a preset number of convolutional layers of the first undetermined model; wherein the preset number is less than the total number of convolutional layers of the first undetermined model.
  • the preset training steps include:
  • the model training steps include:
  • the trained target model When it is determined according to the sample category and the predicted category that the trained target model does not meet the preset stop iteration condition, determine the target loss value according to the sample category and the predicted category, update the parameters of the target model according to the target loss value, obtain a trained target model, and use the trained target model as a new target model.
  • the model training module is configured to perform feature extraction on the sample sound according to a preset cycle to obtain sample features of multiple cycles; according to the sample features of multiple cycles, obtain the first sample similarity between the sample sound and multiple sample categories.
  • the model training module is configured to obtain a first feature code corresponding to the sample feature for each period of the sample feature; and obtain a second sample similarity between the sample feature and multiple sample categories according to the first feature code; and calculate a first sample similarity between the sample sound and multiple sample categories according to the second sample similarity of multiple sample features.
  • the sample category includes the target category and non-target category.
  • the model training module is configured to perform model quantization processing on model parameters of the second undetermined model to obtain the target sound classification model.
  • an electronic device including:
  • memory for storing processor-executable instructions
  • the processor is configured to execute the steps of the voice recognition method provided in the first aspect of the present disclosure.
  • a non-transitory computer-readable storage medium on which computer program instructions are stored, and when the program instructions are executed by a processor, the steps of the voice recognition method provided in the first aspect of the present disclosure are implemented.
  • a vehicle is provided, and the vehicle includes the electronic device provided in the third aspect of the present disclosure.
  • a computer program product includes a computer program executable by a programmable device, and the computer program has a code portion for executing the steps of the voice recognition method provided in the first aspect of the present disclosure when executed by the programmable device.
  • the technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects: collecting environmental sounds through the sound detection device; classifying the environmental sounds according to the target sound classification model to obtain the target category corresponding to the environmental sound; displaying the target category through the display device.
  • comprehensive and accurate recognition and classification of surrounding environmental objects can be realized through sound recognition of environmental sounds, thereby solving the problem of blind spots in camera or radar detection, and improving the reliability of object detection.
  • the complexity of the target sound classification model can be reduced through model compression, and at the same time, the sound classification accuracy of the trained target sound classification model can be guaranteed through two training sessions, so that the target sound classification model can be deployed on the vehicle side or device side, improving the timeliness of sound recognition and classification.
  • Fig. 1 is a flowchart of a voice recognition method according to an exemplary embodiment.
  • Fig. 2 is a schematic diagram showing a sound detection device arranged on a vehicle exterior mirror according to an exemplary embodiment.
  • Fig. 3 is a flowchart showing a method for training a target sound classification model according to an exemplary embodiment.
  • Fig. 4 is a flow chart showing a step of S102 according to the embodiment shown in Fig. 1 .
  • Fig. 5 is a block diagram of a voice recognition device according to an exemplary embodiment.
  • Fig. 6 is a block diagram of another voice recognition device according to an exemplary embodiment.
  • Fig. 7 is a block diagram of an electronic device according to an exemplary embodiment.
  • Fig. 8 is a block diagram of a vehicle according to an exemplary embodiment.
  • the present disclosure can be applied to voice recognition scenarios, such as automatic driving or assisted driving of vehicles based on voice recognition, smart home monitoring, health detection, screening of defective products in machine production lines, fault detection of industrial equipment, and other scenarios.
  • voice recognition scenarios such as automatic driving or assisted driving of vehicles based on voice recognition, smart home monitoring, health detection, screening of defective products in machine production lines, fault detection of industrial equipment, and other scenarios.
  • radar or camera is mainly used to obtain information of surrounding objects, and automatic driving or assisted driving is realized through image processing. Both radar and camera have detection blind spots.
  • lidar can locate objects at a distance of a few meters around the car body, but cannot locate distant moving objects beyond the range; in the case of using body cameras for visual recognition, there are also certain blind spots for visual detection, such as fuzzy video at a distance that makes it impossible to recognize, or when the camera is blocked and cannot be detected, it is difficult to accurately identify surrounding objects through the camera.
  • the present disclosure takes the application scenario of vehicle automatic driving as an example, but is not limited to this application scenario.
  • the method provided by the present disclosure can be used in scenarios such as voice recognition-based smart home monitoring, health detection, defective product screening of machine production lines, and industrial equipment fault detection.
  • the present disclosure provides a sound recognition method, device, medium, equipment, program product, and vehicle, which can collect environmental sounds through a sound detection device; classify the environmental sounds according to the target sound classification model, and obtain the target category corresponding to the environmental sound, thereby solving the problem of blind spots in camera or radar detection, and improving the reliability of object detection.
  • Fig. 1 is a voice recognition method shown according to an exemplary embodiment. As shown in Fig. 1, the method may include:
  • S101 Collect ambient sound through a sound detection device.
  • the sound detection device may include one or more sound sensors, such as an electrodynamic microphone, a condenser microphone, or a MEMS (Micro-Electro-Mechanical System, Micro-Electro-Mechanical System) microphone.
  • sound sensors such as an electrodynamic microphone, a condenser microphone, or a MEMS (Micro-Electro-Mechanical System, Micro-Electro-Mechanical System) microphone.
  • MEMS Micro-Electro-Mechanical System, Micro-Electro-Mechanical System
  • the installation position of the sound detection device can be different.
  • the sound detection device in the scene of vehicle automatic driving or assisted driving, can be installed at any one or more positions outside the vehicle body, such as the position of the body on both sides of the vehicle, the position of the windows on both sides of the vehicle, the position of the front face of the vehicle, the position of the rear face of the vehicle, the position of the roof or the position of the exterior mirror of the vehicle, etc.
  • the sound detection device can collect the ambient sound around the vehicle.
  • the sound detection device can be installed in each room in the family, and the sound detection device can collect the ambient sound of each room.
  • the target sound classification model may be obtained after training a general sound classification model according to sample sounds.
  • the display device may include an image display device (such as a display screen), and a sound display device (such as a buzzer or a sounder).
  • an image display device such as a display screen
  • a sound display device such as a buzzer or a sounder
  • the display device may include one or more of the vehicle's exterior reflector, interior reflector, and central control screen, through which the display device can display target images corresponding to the target category, so as to prompt the user that the target category appears in the environment.
  • the exterior mirrors may include two exterior mirrors on both sides of the vehicle.
  • the display device may also include a car audio device, through which the target sound corresponding to the target category can be given.
  • the environmental sound is collected by the sound detection device; the environmental sound is classified according to the target sound classification model to obtain the target category corresponding to the environmental sound, and the target category is displayed by the display device.
  • comprehensive and accurate recognition and classification of surrounding environmental objects can be realized through sound recognition of environmental sounds, thereby solving the problem of blind spots in camera or radar detection, and improving the reliability of object detection.
  • step S103 may display the target category in the following manner:
  • the target image corresponding to the target category may be determined according to the category image correspondence, where the category image correspondence includes the correspondence between the target category and the target image.
  • the target image corresponding to the target category "person” is "humanoid image”
  • the target image corresponding to the target category "animal” is “quadruped image”
  • the target image corresponding to the target category "ambulance” is “ambulance vehicle image”, etc.
  • the target image may be displayed in a preset area of the display device.
  • the target image can be displayed in preset areas of any one or multiple sides of the exterior reflector.
  • the preset area may be a side area, for example, the preset area may be one or more of the upper side area, lower side area, left side area or right side area of the exterior mirror.
  • the target image corresponding to the target category can be displayed, thereby accurately prompting the user that the target category appears, so as to assist the user to carry out corresponding emergency treatment.
  • the display device includes multiple exterior mirrors, interior mirrors and central control screens of the vehicle, the multiple display devices can serve as backups for each other to improve the reliability of image display and avoid failure to display the target image due to a failure of a certain display device.
  • the target image can be displayed in the following manner:
  • the target image is displayed through the exterior mirror on the driver's side of the vehicle.
  • the target image can be displayed on the corresponding exterior mirror according to the vehicle passenger condition, so as to prompt the user to detect the corresponding target category.
  • there may be one or more sound detection devices there may be one or more sound detection devices, and the sound detection devices may be installed in the exterior reflectors on any side or multiple sides of the vehicle, and the environmental sound is the environmental sound around the vehicle.
  • the sound detection device may include one or more sound sensors, such as electrodynamic microphones, condenser microphones, or MEMS microphones.
  • the ambient sound outside the vehicle can be collected conveniently without affecting the appearance of the vehicle, and at the same time, the sound detection device can be prevented from being damaged by the sun and rain, and the service life of the sound detection device can be improved.
  • the multiple sound detection devices can be respectively arranged in the left side exterior mirror and the right side exterior mirror of the vehicle.
  • the left exterior mirror can be the exterior mirror on the driver's side, and the right exterior mirror can be the exterior mirror on the passenger's seat side; or vice versa, the left exterior mirror can be the exterior mirror on the passenger's seat side, and the right exterior mirror can be the exterior mirror on the driver's side.
  • This disclosure does not limit it.
  • the sound intensities on both sides of the vehicle can be respectively detected by a plurality of sound detection devices, so that the source direction of the ambient sound can be determined.
  • the target mirror can also be determined according to the source direction, and the target image can be displayed through the target mirror.
  • the sound intensity detected by the sound detection device of the left exterior mirror is greater than the sound intensity detected by the sound detection device of the right exterior mirror, it can be determined that the source direction of the ambient sound is the left side of the vehicle, the left exterior mirror is used as the target mirror, and the target image is displayed on the target mirror.
  • the sound intensity detected by the sound detection device of the right exterior mirror is greater than the sound intensity detected by the sound detection device of the left exterior mirror, it can be determined that the source direction of the ambient sound is the right side of the vehicle, and the right exterior mirror is used as the target mirror, and the target image is displayed on the target mirror.
  • the user can be prompted in the direction in which the target category may appear through the external mirror showing the target image.
  • a plurality of sound detection devices may be respectively arranged in the left exterior reflector and the right exterior reflector of the vehicle.
  • Shockproof treatment and/or waterproof treatment may be performed on each sound detection device.
  • rubber bladders can be wrapped around each sound detection device to prevent water and shock, and to reduce wind noise.
  • the sound detection device may include a data interface, which may be connected to the vehicle-mounted sound module of the vehicle through a cable, so as to transmit the detected environmental sound to the vehicle-mounted sound module through the data interface, so that the vehicle-mounted sound module can classify the environmental sound.
  • the sound detection device may further include a power interface and a clock interface, and the power interface and clock interface may also be connected to an in-vehicle module of the vehicle through a cable, so as to provide power and a clock to the sound detection device.
  • the above-mentioned target sound classification model may be obtained after training according to the target neural network model.
  • the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the artificial intelligence model for classifying and recognizing sounds in related technologies can use a large-scale and complex neural network model.
  • This large-scale model needs to run on a server, and has high hardware requirements for the server. It can be deployed on a cloud server, but it is difficult to deploy on the vehicle or device.
  • the real-time performance of voice recognition through the cloud is not enough, which affects the reliability and timeliness of voice-assisted automatic driving functions.
  • the target sound classification model is obtained after training according to a target neural network model, and the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model. In this way, the complexity of the target sound classification model can be reduced, and the dependence of the model on hardware can be reduced, so that the target sound classification model can be deployed to the vehicle end or the device end.
  • the environmental sound is collected by the sound detection device; the environmental sound is classified and processed according to the target sound classification model, and the target category corresponding to the environmental sound is obtained; wherein, the target sound classification model is obtained after training according to the target neural network model, and the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the complexity of the target sound classification model can be reduced through model compression, and at the same time, the sound classification accuracy of the trained target sound classification model can be guaranteed through two training sessions, so that the target sound classification model can be deployed on the vehicle side or device side, improving the timeliness of sound recognition and classification.
  • Fig. 3 is a flowchart of a method for training a target sound classification model according to an exemplary embodiment. As shown in Fig. 3, the method for training may include:
  • sample sounds for training can be obtained from a public sound database, and each sample sound can be labeled with a sample category; video data can also be obtained from a video database, and audio data can be heavily extracted from the video data as sample sounds, and then each sample voice can be labeled with a sample category.
  • the sample category may be a category of sounds, such as alarm sounds, human voices, crying sounds, vehicle horn sounds, vehicle emergency brake sounds, and the like.
  • the sample category may be a strong label or a weak label of the sample sound.
  • the sample category is a strong label, it is necessary to label the sample category appearing in the sample sound and the start and end time of the sample category; in the case of the sample category being a weak label, you can only label the sample category appearing in the sample sound without specifying the specific start and end time.
  • weak labels can reduce the workload of manual labeling and improve the efficiency of sample acquisition.
  • the above sample categories may include target categories and non-target categories.
  • the target category includes multiple target categories obtained by processing the ambient sound through the target sound classification model, and the non-target category represents other categories except the above target category, that is, the category not output by the target sound classification model.
  • the target sound classification model is used in a vehicle automatic driving scene. It is expected that the output target categories after classifying environmental sounds may include "alarm sounds, human voices, crying sounds, vehicle horn sounds, and vehicle emergency braking sounds".
  • the final target sound classification model misidentified the wind sound as crying.
  • the model can be trained by adding sample sounds of non-target categories, so that the sounds before the target category can be absorbed, so that the trained model can learn finer features, improve the abstraction ability of the model, and further improve the distinction between target categories and non-target categories, which also improves the accuracy of voice recognition.
  • sample audio features may include FBANK features, MFCC features, or PNCC features.
  • sample audio features may include FBANK features, MFCC features, or PNCC features.
  • a 1024-point Fourier transform can be calculated every 20ms, with a window length of 64ms, and then 64 mel filter banks can be used to obtain 64-dimensional FBANK features.
  • the sample audio features can be input into a preset neural network model to perform a preset training step.
  • the preset neural network model can be a convolutional neural network in the related art
  • the preset training step can be a convolutional neural network training step in the related art.
  • the preset neural network model may be a mobile convolutional neural network model, such as MobileNet, etc.
  • the mobile convolutional neural network may include N layers of convolutional layers, and N may be any positive integer greater than or equal to 5, for example, N may be 10 or 16.
  • the target neural network model can be obtained according to a preset number of convolutional layers of the first undetermined model; wherein, the preset number is less than the total number of convolutional layers of the first undetermined model.
  • the total number of convolutional layers of the first undetermined model is N
  • the preset number may be M
  • M is smaller than N, for example, N is 10, and M may be 5.
  • the parameters of the first M convolutional layers of the first undetermined model obtained after training are used as the initialization parameters of the convolutional layer of the target neural network model. Then execute the above preset training steps on the target neural network model to obtain the second undetermined model.
  • the second undetermined model may be used as the target sound classification model.
  • a preset training step is performed on the preset neural network model to obtain a first undetermined model, and model compression is performed on the first undetermined model to obtain a target neural network model; the preset training step is performed on the target neural network model according to the sample sounds and sample categories to obtain a second undetermined model; according to the second undetermined model, the target voice classification model is determined.
  • the complexity of the target sound classification model obtained after training can be reduced, and the accuracy of the target sound classification model for sound classification can be ensured, so that the obtained target sound classification model is more streamlined and efficient, and the dependence on hardware is reduced, thereby reducing the difficulty of vehicle-side or device-side deployment.
  • the target sound classification model may be obtained after model quantization processing is performed on the model parameters of the second undetermined model.
  • the model quantization process may include model parameter compression.
  • the parameters of the model may be quantized to a preset number of digits, which may be 8 bits or 16 bits.
  • all floating point parameters are quantized and compressed to integer parameters. In this way, the size of the model can be further reduced, and the computing power consumption of the model can be reduced while ensuring that the performance of the model is basically unchanged.
  • the above preset training steps may include the following methods:
  • the model training step is cyclically executed to train the target model until it is determined according to the sample category and the predicted category that the trained target model meets the preset stop iteration condition, the target model includes a preset neural network model or a target neural network model, and the predicted category is the output category of the sample sound input to the trained target model.
  • the preset iteration stop condition mentioned above may be a common stop iteration condition in the prior art, such as the condition that the similarity difference between the sample category and the predicted category is smaller than the preset similarity difference threshold, which is not limited in the present disclosure.
  • the above model training steps include:
  • feature extraction is performed on the sample sound according to a preset cycle to obtain sample features of multiple cycles; then, according to the sample features of multiple cycles, the similarity between the sample sound and multiple first samples of the sample category is obtained.
  • the above-mentioned sample sound can be any sample audio data longer than 5 seconds
  • the above-mentioned preset period can be any time between 20 milliseconds and 2 seconds.
  • the preset period can be 1 second or 500 milliseconds.
  • the sample audio data can be divided according to the preset period, and feature extraction is performed on the divided audio segments to obtain the sample features of each divided audio segment.
  • obtaining the similarity between the sample sound and the first samples of a plurality of the sample categories may include any one of the following similarity obtaining method 1 and similarity obtaining method 2, wherein:
  • the first method for obtaining the similarity may include the following steps:
  • the first feature code corresponding to the sample feature is acquired.
  • a second feature code of the sample sound is calculated according to the first feature codes of the plurality of sample features.
  • the average value of a plurality of first feature codes can be used as the second feature code, that is, the embedding layer (Embedding) before the output layer is averaged, and then according to the averaged second feature code, the similarity between the sample sound and multiple first samples of the sample category is obtained.
  • the similarity between the second feature code and the sample feature code corresponding to each sample category may be calculated, and the similarity may be used as the first sample similarity between the sample sound and each sample category.
  • the recognition accuracy of the environmental sound that is basically consistent with the duration of the sample sound is relatively high.
  • the model that uses the similarity acquisition method 1 for training it can obtain sample sounds with a shorter duration for training.
  • the second method for obtaining the similarity may include the following steps:
  • a first feature code corresponding to the sample feature is obtained; and according to the first feature code, a second sample similarity between the sample feature and multiple sample categories is obtained.
  • the first sample similarity between the sample sound and the multiple sample categories is calculated.
  • the sample feature of each cycle can be input into the convolutional layer to obtain the first feature code corresponding to the sample feature, calculate the similarity between the first feature code and the sample feature code corresponding to each sample category, and use the similarity as the second sample similarity between the sample feature and each sample category.
  • the preset iteration stop condition mentioned above may be a common stop iteration condition in the prior art, such as the condition that the similarity difference between the sample category and the predicted category is smaller than the preset similarity difference threshold, which is not limited in the present disclosure.
  • the target model can be trained, and the accuracy of the target category obtained after the trained target template recognizes the environmental sound can be improved.
  • Fig. 4 is a flowchart of a step S102 according to the embodiment shown in Fig. 1, as shown in Fig. 4, the above step S102 may include:
  • the first candidate category whose first target similarity is ranked top N from large to small can be used as the target category; the first candidate category whose first target similarity is greater than or equal to the preset similarity threshold can also be used as the target category; or the first candidate category whose first target similarity is ranked top N from large to small, and the first target similarity is greater than or equal to the preset similarity threshold can be used as the target category.
  • step S1022 may include:
  • the first candidate category whose first target similarity is ranked in descending order and whose first target similarity is greater than or equal to a preset similarity threshold is used as the second candidate category.
  • the target category is determined according to the second candidate category.
  • the second candidate category may be one or more.
  • the second candidate category may be directly used as the target category.
  • the plurality of second candidate categories may be directly used as the target category; or the second candidate category with the highest similarity to the first target may be used as the target category.
  • the target category can also be determined in the following manner:
  • the category relationship between each second candidate category and other second candidate categories is determined.
  • the preset category correspondence relationship includes a category relationship between any two second candidate categories, and the category relationship includes a confusion relationship and a similar relationship.
  • the confusion relationship is used to characterize the confusing categories that are not of the same kind between the two second candidate categories, such as "wind” and "crying"; the homogeneity relationship is used to characterize the categories in the same scene between the two second candidate categories, such as "crying" and "human voice”.
  • the target category is determined according to the second candidate category and the category relationship.
  • the multiple second candidate categories may be directly used as the target category, or the second candidate category with the highest similarity to the first target may be used as the target category.
  • the confusion coefficients of the multiple second candidate categories can be calculated, and when the confusion coefficients are less than or equal to the preset confusion threshold, the multiple second candidate categories are used as the target category, or the second candidate category with the largest similarity to the first target is used as the target category; and when the confusion coefficient is greater than the preset confusion threshold, the target category may not be output.
  • the confusion coefficient may represent the category relationship as the proportion of the number of confusion relationships in the total number of category relationships of the plurality of second candidate categories.
  • the plurality of second candidate categories is 5, and there is a category relationship between each two second candidate categories, and the total number of category relationships is 6; where the number of category relationships that are confusing relationships is 3, the confusion coefficient may be 0.5.
  • the aforementioned preset confusion threshold may be 0.7. In this way, the confusion coefficient is smaller than the preset confusion threshold. Therefore, the plurality of second candidate categories may be used as target categories, or the second candidate category with the largest similarity to the first target may be used as the target category.
  • the recognition accuracy of the model can be determined according to the confusion relationship of the identified candidate categories, and when the confusion relationship satisfies the preset condition (the confusion coefficient is less than or equal to the preset confusion threshold), it is determined that the recognition accuracy of the model meets the condition, so that the obtained target category is more accurate.
  • Fig. 5 is a block diagram of a voice recognition device 500 according to an exemplary embodiment. As shown in Fig. 5, the device 500 may include:
  • the sound collection module 501 is configured to collect environmental sounds through the sound detection device
  • the sound classification module 502 is configured to classify the environmental sound according to the target sound classification model to obtain the target category corresponding to the environmental sound;
  • the display module 503 is configured to display the target category through a display device.
  • the display module 503 is configured to determine the target image corresponding to the target category; and display the target image through the display device.
  • the presentation module 503 is configured to determine the target image corresponding to the target category according to the category image correspondence, the category image correspondence includes the correspondence between the target category and the target image.
  • the display module 503 is configured to display the target image in a preset area of the display device.
  • the display device includes one or more of the vehicle's exterior reflector, interior reflector and central control screen.
  • the display module 503 is configured to display the target image through the exterior reflectors on both sides of the vehicle when it is detected that there are passengers in the passenger seat;
  • the sound detection devices are arranged in the exterior reflectors on any one side or multiple sides of the vehicle, and the ambient sound is the ambient sound around the vehicle.
  • the multiple sound detection devices are respectively arranged in the left exterior mirror and the right exterior mirror of the vehicle.
  • the sound classification module 502 is configured to input the environmental sound into the target sound classification model to obtain one or more first candidate categories, and a first target similarity between the environmental sound and each first candidate category; and determine the target category from the first candidate categories according to the first target similarity.
  • the voice classification module 502 is configured to rank the first N positions in the first target similarity from large to small, and the first candidate category whose first target similarity is greater than or equal to a preset similarity threshold, as the second candidate category; determine the target category according to the second candidate category.
  • the sound classification module 502 is configured to determine a category relationship between each second candidate category and other second candidate categories according to a preset category correspondence; the preset category correspondence includes a category relationship between any two second candidate categories, and the category relationship includes a confusion relationship and a similar relationship; according to the second candidate category and the category relationship, determine the target category.
  • the target sound classification model is obtained after training according to a target neural network model
  • the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • Fig. 6 is a block diagram of another voice recognition device shown according to an exemplary embodiment. As shown in Fig. 6, the device may further include a model training module 601, and the model training module 601 is configured to:
  • the target sound classification model is determined.
  • the model training module 601 is configured to acquire the target neural network model according to a preset number of convolutional layers of the first undetermined model; wherein the preset number is less than the total number of convolutional layers of the first undetermined model.
  • the model training module 601 is configured to cyclically execute the model training step to train the target model until it is determined according to the sample category and the predicted category that the trained target model meets the preset stop iteration condition, the target model includes a preset neural network model or the target neural network model, and the predicted category is the output category of the sample sound input to the trained target model;
  • the model training steps include:
  • the predicted category corresponding to the sample sound from a plurality of the sample categories
  • the target loss value is determined according to the sample category and the predicted category, and the parameters of the target model are updated according to the target loss value to obtain a trained target model, and the trained target model is used as a new target model.
  • the model training module 601 is configured to perform feature extraction on the sample sound according to a preset cycle to obtain sample features of multiple cycles; according to the sample features of multiple cycles, obtain the similarity between the sample sound and multiple first samples of the sample category.
  • the model training module 601 is configured to obtain a first feature code corresponding to the sample feature for each period of the sample feature; and according to the first feature code, obtain a second sample similarity between the sample feature and a plurality of sample categories; and calculate and obtain a first sample similarity between the sample sound and multiple sample categories according to the second sample similarity of the multiple sample features.
  • the sample category includes the target category and non-target categories.
  • model training module 601 is configured to perform model quantization processing on model parameters of the second undetermined model to obtain the target sound classification model.
  • the device in the above embodiments of the present disclosure is used to collect environmental sounds through the sound detection device; classify the environmental sounds according to the target sound classification model to obtain the target category corresponding to the environmental sound; wherein, the target sound classification model is obtained after training according to the target neural network model, and the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the target sound classification model is obtained after training according to the target neural network model
  • the target neural network model is a model obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
  • the complexity of the target sound classification model can be reduced through model compression, and at the same time, the sound classification accuracy of the trained target sound classification model can be guaranteed through two training sessions, so that the target sound classification model can be deployed on the vehicle side or device side, improving the timeliness of sound recognition and classification.
  • the present disclosure also provides a computer-readable storage medium on which computer program instructions are stored, and when the program instructions are executed by a processor, the steps of the voice recognition method provided in the present disclosure are implemented.
  • Fig. 7 is a block diagram of an electronic device 900 according to an exemplary embodiment.
  • the electronic device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, a router, a vehicle terminal, and the like.
  • electronic device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.
  • the processing component 902 generally controls the overall operations of the electronic device 900, such as those associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 902 may include one or more processors 920 to execute instructions to complete all or part of the steps of the above voice recognition method.
  • processing component 902 may include one or more modules that facilitate interaction between processing component 902 and other components.
  • the processing component 902 may include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
  • the memory 904 is configured to store various types of data to support operations at the electronic device 900 . Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, videos, and the like.
  • Memory 904 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power component 906 provides power to various components of the electronic device 900 .
  • Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 900 .
  • the multimedia component 908 includes a screen providing an output interface between the electronic device 900 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
  • the multimedia component 908 includes a front camera and/or a rear camera. When the electronic device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
  • the audio component 910 is configured to output and/or input audio signals.
  • the audio component 910 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 900 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 904 or sent via communication component 916 .
  • the audio component 910 also includes a speaker for outputting audio signals.
  • the I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Sensor assembly 914 includes one or more sensors for providing various aspects of status assessment for electronic device 900 .
  • the sensor component 914 can detect the open/closed state of the electronic device 900, the relative positioning of components, such as the display and keypad of the electronic device 900, the sensor component 914 can also detect the position change of the electronic device 900 or a component of the electronic device 900, the presence or absence of user contact with the electronic device 900, the orientation or acceleration/deceleration of the electronic device 900 and the temperature change of the electronic device 900.
  • Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 914 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 916 is configured to facilitate wired or wireless communication between the electronic device 900 and other devices.
  • the electronic device 900 can access a wireless network based on communication standards, such as Wi-Fi, 2G, 3G, 4G, 5G, NB-IOT, eMTC, or other 6G, etc., or a combination thereof.
  • the communication component 916 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the above voice recognition method.
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DSPDs Digital Signal Processing Devices
  • PLDs Programmable Logic Devices
  • FPGAs Field Programmable Gate Arrays
  • controllers microcontrollers, microprocessors or other electronic components for performing the above voice recognition method.
  • non-transitory computer-readable storage medium including instructions, such as the memory 904 including instructions, which can be executed by the processor 920 of the electronic device 900 to implement the above voice recognition method.
  • the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • a computer program product comprising a computer program executable by a programmable device, the computer program having a code portion for performing the steps of the above-mentioned voice recognition method when executed by the programmable device.
  • FIG. 8 is a block diagram of a vehicle according to an exemplary embodiment. As shown in FIG. 8 , the apparatus may include the above-mentioned electronic device 900 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
  • Traffic Control Systems (AREA)

Abstract

本公开涉及一种声音识别方法、装置、介质、设备、程序产品及车辆。该方法包括:通过声音检测装置采集环境声音;根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别;通过展示装置展示所述目标类别。这样,可以通过对环境声音进行声音识别从而实现对周边环境物体的全面准确的识别和分类,提高了物体检测的可靠性。

Description

声音识别方法、装置、介质、设备、程序产品及车辆
相关申请的交叉引用
本公开要求在2022年1月18日提交中国专利局、申请号为202210055284.9、名称为“声音识别方法、装置、存储介质、电子设备及车辆”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及人工智能技术领域,具体地,涉及一种声音识别方法、装置、介质、设备、程序产品及车辆。
背景技术
随着人工智能技术在车辆技术领域的应用,车辆的自动驾驶或辅助驾驶技术得到了广泛发展。在相关技术中,主要以雷达或摄像头获取周边物体的信息,并通过图像处理实现自动驾驶或辅助驾驶。而雷达或摄像头都会存在检测盲区,在某些特定场景下存在环境检测可靠性较低的问题。
发明内容
为克服相关技术中存在的上述问题,本公开提供一种声音识别方法、装置、介质、设备、程序产品及车辆。
根据本公开实施例的第一方面,提供一种声音识别方法,所述方法包括:
通过声音检测装置采集环境声音;
根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别;
通过展示装置展示所述目标类别。
可选地,所述通过展示装置展示所述目标类别包括:
确定所述目标类别对应的目标图像;
通过所述展示装置展示所述目标图像。
可选地,所述确定所述目标类别对应的目标图像包括:
根据类别图像对应关系,确定所述目标类别对应的目标图像,所述类别图像对应关系包括所述目标类别与所述目标图像的对应关系。
可选地,所述通过所述展示装置展示所述目标图像包括:
在所述展示装置的预设区域展示所述目标图像。
可选地,所述展示装置包括车辆的外反光镜、内反光镜和中控屏幕中的一种或多种。
可选地,在所述展示装置包括车辆的外反光镜的情况下,所述通过所述展示装置展示所述目标图像包括:
在检测到副驾驶座位有乘客乘坐的情况下,通过车辆两侧的外反光镜展示所述目标图像;或者,
在未检测到副驾驶座位有乘客乘坐的情况下,通过车辆驾驶员侧的外反光镜展示所述目标图像。
可选地,所述声音检测装置为一个或多个,所述声音检测装置设置在车辆的任意一侧或多侧的外反光镜内,所述环境声音为所述车辆周边的环境声音。
可选地,在所述声音检测装置为多个的情况下,所述多个声音检测装置分别设置在车辆的左侧外反光镜内和右侧外反光镜内。
可选地,所述根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别包括:
将所述环境声音输入所述目标声音分类模型,得到一个或多个第一候选类别,以及所述环境声音与每个第一候选类别的第一目标相似度;
根据所述第一目标相似度从所述第一候选类别中确定所述目标类别。
可选地,所述根据所述第一目标相似度从所述第一候选类别中确定所述目标类别包括:
将所述第一目标相似度从大到小排名前N位,且所述第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为第二候选类别;
根据所述第二候选类别确定所述目标类别。
可选地,所述第二候选类别为多个,所述根据所述第二候选类别确定所述目标类别包括:
根据预设类别对应关系,确定每个第二候选类别与其他第二候选类别之间的类别关系;所述预设类别对应关系包括任意两个第二候选类别之间的类别关系,所述类别关系包括混淆关系和同类关系;
根据所述第二候选类别和类别关系,确定所述目标类别。
可选地,所述目标声音分类模型为根据目标神经网络模型训练后得到的,所述目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。
可选地,所述目标声音分类模型是通过以下方式训练得到的:
获取用于训练的多个样本声音和每个所述样本声音对应的样本类别;
根据所述样本声音和所述样本类别对预设神经网络模型执行预设训练步骤,得到第一待定模型;
对所述第一待定模型进行模型压缩,得到目标神经网络模型;
根据所述样本声音和所述样本类别对所述目标神经网络模型执行所述预设训练步骤,得到第二待定模型;
根据所述第二待定模型,确定所述目标声音分类模型。
可选地,所述对所述第一待定模型进行模型压缩,得到目标神经网络模型包括:
根据所述第一待定模型的预设数目个卷积层,获取所述目标神经网络模型;其中,所述预设数目小于所述第一待定模型的卷积层总层数。
可选地,所述预设训练步骤包括:
循环执行模型训练步骤对目标模型进行训练,直至根据所述样本类别和预测类别确定训练后的目标模型满足预设停止迭代条件,所述目标模型包括预设神经网络模型或者所述目标神经网络模型,所述预测类别为所述样本声音输入该训练后的目标模型后输出的类别;
所述模型训练步骤包括:
获取所述样本声音与多个所述样本类别的第一样本相似度;
根据所述第一样本相似度,从多个所述样本类别中确定所述样本声音对应的预测类别;
在根据所述样本类别和所述预测类别确定训练后的目标模型不满足预设停止迭代条件的情况下,根据所述样本类别和所述预测类别确定目标损失值,根据所述目标损失值更新所述目标模型的参数,得到训练后的目标模型,并将该训练后的目标模型作为新的目标模型。
可选地,所述获取所述样本声音与多个所述样本类别的第一样本相似度包括:
将所述样本声音按照预设周期进行特征提取,得到多个周期的样本特征;
根据多个周期的样本特征,获取所述样本声音与多个所述样本类别的第一样本相似度。
可选地,所述根据多个周期的样本特征,获取所述样本声音与多个所述样本类别的第一样本相似度包括:
针对每个周期的样本特征,获取该样本特征对应的第一特征编码;并根据所述第一特征编码,得到该样本特征与多个样本类别的第二样本相似度;
根据多个样本特征的所述第二样本相似度,计算得到所述样本声音与多个所述样本类别的第一样本相似度。
可选地,所述样本类别包括所述目标类别和非目标类别。
可选地,所述根据所述第二待定模型,确定所述目标声音分类模型包括:
对所述第二待定模型的模型参数进行模型量化处理,得到所述目标声音分类模型。
根据本公开实施例的第二方面,提供一种声音识别装置,所述装置包括:
声音采集模块,被配置为通过声音检测装置采集环境声音;
声音分类模块,被配置为根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别;
展示模块,被配置为通过展示装置展示所述目标类别。
可选地,所述展示模块,被配置为确定所述目标类别对应的目标图像;通过所述展示装置展示所述目标图像。
可选地,所述展示模块,被配置为根据类别图像对应关系,确定所述目标类别对应的目标图像,所述类别图像对应关系包括所述目标类别与所述目标图像的对应关系。
可选地,所述展示模块,被配置为在所述展示装置的预设区域展示所述目标图像。
可选地,所述展示装置包括车辆的外反光镜、内反光镜和中控屏幕中的一种或多种。
可选地,在所述展示装置包括车辆的外反光镜的情况下,所述展示模块,被配置为在检测到副驾驶座位有乘客乘坐的情况下,通过车辆两侧的外反光镜展示所述目标图像;或者,在未检测到副驾驶座位有乘客乘坐的情况下,通过车辆驾驶员侧的外反光镜展示所述目标图像。
可选地,所述声音检测装置为一个或多个,所述声音检测装置设置在车辆的任意一侧或多侧的外反光镜内,所述环境声音为所述车辆周边的环境声音。
可选地,在所述声音检测装置为多个的情况下,所述多个声音检测装置分别设置在车辆的左侧外反光镜内和右侧外反光镜内。
可选地,所述声音分类模块,被配置为将所述环境声音输入所述目标声音分类模型,得到一个或多个第一候选类别,以及所述环境声音与每个第一候选类别的第一目标相似度;根据所述第一目标相似度从所述第一候选类别中确定所述目标类别。
可选地,所述声音分类模块,被配置为将所述第一目标相似度从大到小排名前N位,且所述第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为第二候选类别;根据所述第二候选类别确定所述目标类别。
可选地,所述第二候选类别为多个,所述声音分类模块,被配置为根据预设类别对应关系,确定每个第二候选类别与其他第二候选类别之间的类别关系;所述预设类别对应关系包括任意两个第二候选类别之间的类别关系,所述类别关系包括混淆关系和同类关系;根据所述第二候选类别和类别关系,确定所述目标类别。
可选地,所述目标声音分类模型为根据目标神经网络模型训练后得到的,所述目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。
可选地,所述装置还包括模型训练模块;所述模型训练模块,被配置为:
获取用于训练的多个样本声音和每个所述样本声音对应的样本类别;
根据所述样本声音和所述样本类别对预设神经网络模型执行预设训练步骤,得到第一待定模型;
对所述第一待定模型进行模型压缩,得到目标神经网络模型;
根据所述样本声音和所述样本类别对所述目标神经网络模型执行所述预设训练步骤,得到第二待定模型;
根据所述第二待定模型,确定所述目标声音分类模型。
可选地,所述模型训练模块,被配置为根据所述第一待定模型的预设数目个卷积层,获取所述目标神经网络模型;其中,所述预设数目小于所述第一待定模型的卷积层总层数。
可选地,所述预设训练步骤包括:
循环执行模型训练步骤对目标模型进行训练,直至根据所述样本类别和预测类别确定训练后的目标模型满足预设停止迭代条件,所述目标模型包括预设神经网络模型或者所述目标神经网络模型,所述预测类别为所述样本声音输入该训练后的目标模型后输出的类别;
所述模型训练步骤包括:
获取所述样本声音与多个所述样本类别的第一样本相似度;
根据所述第一样本相似度,从多个所述样本类别中确定所述样本声音对应的预测类别;
在根据所述样本类别和所述预测类别确定训练后的目标模型不满足预设停止迭代条件的情况下,根据所述样本类别和所述预测类别确定目标损失值,根据所述目标损失值更新所述目标模型的参数,得到训练后的目标模型,并将该训练后的目标模型作为新的目标模型。
可选地,所述模型训练模块,被配置将所述样本声音按照预设周期进行特征提取,得到多个周期的样本特征;根据多个周期的样本特征,获取所述样本声音与多个所述样本类别的第一样本相似度。
可选地,所述模型训练模块,被配置针对每个周期的样本特征,获取该样本特征对应的第一特征编码;并根据所述第一特征编码,得到该样本特征与多个样本类别的第二样本相似度;根据多个样本特征的所述第二样本相似度,计算得到所述样本声音与多个所述样本类别的第一样本相似度。
可选地,所述样本类别包括所述目标类别和非目标类别。
可选地,所述模型训练模块,被配置对所述第二待定模型的模型参数进行模型量化处理,得到所述目标声音分类模型。
根据本公开实施例的第三方面,提供一种电子设备,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行本公开第一方面所提供的声音识别方法的步骤。
根据本公开实施例的第四方面,提供一种非临时性计算机可读存储介质,其上存储有计算机程序指令,该程序指令被处理器执行时实现本公开第一方面所提供的声音识别方法的步骤。
根据本公开实施例的第五方面,提供一种车辆,该车辆包括本公开第三方面所提供的电子设备。
根据本公开实施例的第六方面,提供一种计算机程序产品,该计算机程序产品包含 能够由可编程的装置执行的计算机程序,该计算机程序具有当由该可编程的装置执行时用于执行本公开第一方面所提供的声音识别方法的步骤的代码部分。
本公开的实施例提供的技术方案可以包括以下有益效果:通过声音检测装置采集环境声音;根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别;通过展示装置展示所述目标类别。这样,可以通过对环境声音进行声音识别从而实现对周边环境物体的全面准确的识别和分类,从而解决摄像头或雷达检测出现盲区的问题,提高了物体检测的可靠性。进一步地,为了能够在车辆端或设备端进行声音检测,可以通过模型压缩降低目标声音分类模型的复杂度,同时又通过两次训练保障了训练后的目标声音分类模型的声音分类准确度,从而可以将目标声音分类模型部署在车辆端或设备端,提高了声音识别和分类的及时性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种声音识别方法的流程图。
图2是根据一示例性实施例示出的一种声音检测装置设置在车辆外反光镜上的示意图。
图3是根据一示例性实施例示出的一种目标声音分类模型的训练方法的流程图。
图4是根据图1所示实施例示出的一种S102步骤的流程图。
图5是根据一示例性实施例示出的一种声音识别装置的框图。
图6是根据一示例性实施例示出的另一种声音识别装置的框图。
图7是根据一示例性实施例示出的电子设备的框图。
图8是根据一示例性实施例示出的一种车辆的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
首先,对本公开的应用场景进行说明。本公开可以应用于声音识别场景,例如基于声音识别的车辆自动驾驶或辅助驾驶、智能家居监控、健康检测、机器产线残次品筛选、工业设备故障检测等场景。以车辆自动驾驶为了,在相关技术中,主要以雷达或摄像头获取周边物体的信息,并通过图像处理实现自动驾驶或辅助驾驶。而雷达或摄像头都会存在检测盲区,例如,激光雷达可以对车身周边几米距离的物体进行定位,无法对超过范围的远处移动物体进行定位;在采用车身摄像头进行视觉识别的情况下,同样存在一定的视觉检测盲区,例如距离较远的视频模糊导致无法识别,或者是在摄像头被遮挡无法检测的情况下,很难通过摄像头实现周边物体的准确识别,这样会导致环境检测可靠性降低,进而会影响车辆自动驾驶的可靠性。
需要说明的是,本公开以车辆自动驾驶的应用场景为例进行说明,但并不局限于该应用场景,例如,在基于声音识别的智能家居监控、健康检测、机器产线残次品筛选、工业设备故障检测等场景均可以使用本公开提供的方法。
为了解决上述问题,本公开提供了一种声音识别方法、装置、介质、设备、程序产 品及车辆,可以通过声音检测装置采集环境声音;根据目标声音分类模型对环境声音进行分类处理,得到该环境声音对应的目标类别,从而解决摄像头或雷达检测出现盲区的问题,提高了物体检测的可靠性。
下面结合具体实施例对本公开进行说明。
图1是根据一示例性实施例示出的一种声音识别方法,如图1所示,该方法可以包括:
S101、通过声音检测装置采集环境声音。
示例地,该声音检测装置可以包括一个或多个声音传感器,例如电动麦克风、电容麦克风、或者MEMS(Micro-Electro-Mechanical System,微机电系统)麦克风。
在不同的应用场景下,该声音检测装置的安装位置可以不同,例如,在车辆自动驾驶或辅助驾驶场景下,该声音检测装置可以设置在车辆的车身外侧任意一个或多个位置,例如车辆两侧车身位置、车辆两侧车窗位置、车辆前脸位置、车辆后脸位置、车顶位置或者车辆外后视镜位置等,通过该声音检测装置可以采集车辆周边的环境声音。在智能家居监控场景下,该声音检测装置可以安装在家庭中的每个房间,通过该声音检测装置可以采集每个房间的环境声音。
S102、根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别。
其中,该目标声音分类模型可以是根据样本声音对通用的声音分类模型进行训练后得到的。
S103、通过展示装置展示该目标类别。
其中,该展示装置可以包括图像展示装置(例如显示屏)、声音展示装置(例如蜂鸣器或发声器)。
示例地,在车辆自动驾驶或辅助驾驶场景下,该展示装置可以包括车辆的外反光镜、内反光镜和中控屏幕中的一种或多种,通过该展示装置可以展示目标类别对应的目标图像,以便提示用户在环境中出现目标类别。其中,外反光镜可以包括车辆两侧的两个外反光镜。
进一步地,该展示装置也可以包括车载音响装置,通过该车载音响装置可以给出目标类别对应的目标声音。
采用上述方法,通过声音检测装置采集环境声音;根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别,并通过展示装置展示该目标类别。这样,可以通过对环境声音进行声音识别从而实现对周边环境物体的全面准确的识别和分类,从而解决摄像头或雷达检测出现盲区的问题,提高了物体检测的可靠性。
进一步地,上述S103步骤可以通过以下方式展示目标类别:
首先,确定目标类别对应的目标图像。
示例地,可以根据类别图像对应关系,确定该目标类别对应的目标图像,该类别图像对应关系包括目标类别与目标图像的对应关系。例如,目标类别“人”对应的目标图像为“人形图像”;目标类别“动物”对应的目标图像为“四足动物图像”;目标类别“救护车”对应的目标图像为“救护车辆图像”等。
然后,通过展示装置展示目标图像。
示例地,可以在展示装置的预设区域展示目标图像。
在该展示装置为外反光镜的情况下,可以在任意一侧或多侧的外反光镜的预设区域展示该目标图像。该预设区域可以是侧边区域,例如该预设区域可以是外反光镜的上侧区域、下侧区域、左侧区域或右侧区域中的一个或多个。
这样,通过展示装置,可以展示目标类别对应的目标图像,从而准确地提示用户出 现目标类别,以便辅助用户进行相应的应急处理。
需要说明的是,若该展示装置包括车辆的外反光镜、内反光镜和中控屏幕中的多个,多个展示装置可以互为备份,提高图像展示的可靠性,避免由于某个展示装置故障导致无法展示目标图像。
进一步地,在展示装置包括车辆的外反光镜的情况下,可以通过以下方式展示目标图像:
在检测到副驾驶座位有乘客乘坐的情况下,通过车辆两侧的外反光镜展示该目标图像;或者,
在未检测到副驾驶座位有乘客乘坐的情况下,通过车辆驾驶员侧的外反光镜展示该目标图像。
这样,可以通过车辆载客情况,在相应的外反光镜上展示目标图像,以便提示用户检测到相应的目标类别。
在本公开的另一实施例中,上述声音检测装置可以为一个或多个,该声音检测装置可以设置在车辆的任意一侧或多侧的外反光镜内,上述环境声音为车辆周边的环境声音。
图2是根据一示例性实施例示出的一种声音检测装置设置在车辆外反光镜上的示意图,如图2所示,该车辆外反光镜包括反光镜片201、反光镜外壳202和声音检测装置203,该声音检测装置203可以安装在反光镜片201和反光镜外壳202之间。该声音检测装置可以包括一个或多个声音传感器,例如电动麦克风、电容麦克风、或者MEMS麦克风。
这样,既可以方便地收集车辆外部的环境声音,又不会影响车辆的外观,同时还可以避免声音检测装置受到日晒雨淋导致受损,提高声音检测装置的使用寿命。
进一步地,若声音检测装置为多个,则多个声音检测装置可以分别设置在车辆的左侧外反光镜内和右侧外反光镜内。
需要说明的是,该左侧外反光镜可以为驾驶员侧的外反光镜,右侧外反光镜可以为副驾驶座位侧的外反光镜;也可以是相反的,该左侧外反光镜可以为副驾驶座位侧的外反光镜,右侧外反光镜可以为驾驶员侧的外反光镜。本公开对此不作限定。
这样,通过多个声音检测装置可以分别检测得到车辆两侧的声音强度,从而可以确定环境声音的来源方向。
进一步地,还可以根据该来源方向确定目标反光镜,并通过该目标反光镜展示目标图像。
例如,若左侧外反光镜的声音检测装置检测的声音强度大于右侧外反光镜的声音检测装置检测的声音强度,则可以确定环境声音的来源方向为车辆的左侧,将左侧外反光镜作为目标反光镜,并将目标图像展示在该目标反光镜上。
再例如,若右侧外反光镜的声音检测装置检测的声音强度大于左侧外反光镜的声音检测装置检测的声音强度,则可以确定环境声音的来源方向为车辆的右侧,将右侧外反光镜作为目标反光镜,并将目标图像展示在该目标反光镜上。
这样,可以通过展示目标图像的外反光镜提示用户目标类别可能出现的方向。
进一步地,车辆的左侧外反光镜和右侧外反光镜内可以分别设置多个声音检测装置。例如,该多个声音检测装置可以为四个,其中两个设置在车辆左侧外反光镜的外壳和反光镜片之间,另外两个设置在车辆右侧外反光镜的外壳和反光镜片之间。可以对每个声音检测装置进行防震处理和/或防水处理。例如,可以在每个声音检测装置的外面包裹橡胶囊,以防水和防震,并能够降低风噪。
进一步地,该声音检测装置可以包括数据接口,该数据接口可以通过线缆与车辆的车载声音模块相连接,以便将检测到的环境声音通过数据接口传输至车载声音模块,以 便车载声音模块对环境声音进行分类处理。
另外,该声音检测装置还可以包括电源接口和时钟接口,该电源接口和时钟接口同样可以通过线缆与车辆的车内模块相连接,以便对该声音检测装置提供电源和时钟。
在本公开的另一实施例中,上述目标声音分类模型可以为根据目标神经网络模型训练后得到的,该目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。
需要说明的是,相关技术中对声音进行分类识别的人工智能模型可以采用大型化的复杂的神经网络模型,这种大型模型需要运行在服务器上,并且对服务器的硬件要求较高,可以部署在云端服务器,但难以部署在车辆端或设备端。然而,通过云端进行声音识别的实时性不够,影响声音辅助的自动驾驶功能的可靠性和及时性。在本实施例中,该目标声音分类模型为根据目标神经网络模型训练后得到的,该目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。这样,可以降低目标声音分类模型的复杂度,降低模型对硬件的依赖程度,从而可以将该目标声音分类模型部署到车辆端或设备端。
采用上述方法,通过声音检测装置采集环境声音;根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别;其中,该目标声音分类模型为根据目标神经网络模型训练后得到的,该目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。这样,可以通过对环境声音进行声音识别从而实现对周边环境物体的全面准确的识别和分类,从而解决摄像头或雷达检测出现盲区的问题,提高了物体检测的可靠性。并且,为了能够在车辆端或设备端进行声音检测,可以通过模型压缩降低目标声音分类模型的复杂度,同时又通过两次训练保障了训练后的目标声音分类模型的声音分类准确度,从而可以将目标声音分类模型部署在车辆端或设备端,提高了声音识别和分类的及时性。
图3是根据一示例性实施例示出的一种目标声音分类模型的训练方法的流程图,如图3所示,该训练方法可以包括:
S301、获取用于训练的多个样本声音和每个样本声音对应的样本类别。
示例地,可以从公开的声音数据库中获取用于训练的多个样本声音,并对每个样本声音标注样本类别;也可以从视频数据库中获取视频数据,并从视频数据很重抽取音频数据作为样本声音,然后对每个样本声音标注样本类别。该样本类别可以是声音的类别,例如警报声、人声、哭声、车辆喇叭声、车辆急刹车声等等。
进一步地,该样本类别可以是样本声音的强标签,也可以是弱标签。在该样本类别为强标签的情况下,需要标注样本声音中出现的样本类别以及该样本类别出现的起始和结束时间;在该样本类别为弱标签的情况下,可以只标注样本声音中出现的样本类别,无需标注具体的起始和结束时间。使用弱标签可以减少人工标注的工作量,提高样本获取效率。
进一步地,上述样本类别可以包括目标类别和非目标类别。该目标类别包括通过目标声音分类模型对环境声音进行处理后得到的多个目标类别,该非目标类别表征除了上述目标类别之外的其他类别,也就是目标声音分类模型不输出的类别。
示例地,该目标声音分类模型用于车辆自动驾驶场景,预期的对环境声音进行分类处理后输出的目标类别可以包括“警报声、人声、哭声、车辆喇叭声和车辆急刹车声”,若样本类别中仅包括这些目标类别,由于实际的环境声音中包括的声音类别较多,因此应用这些样本类别进行训练后的目标声音分类模型会存在误识别的问题,例如,环境的风声与哭声有一定的相似度,若不专门设计风声的样本进行训练,会导致训练后的目标声音分类模型误将风声识别为哭声。而在本实施例中,可以通过增加非目标类别的样本 声音对模型进行训练,从而可以对目标类别之前的声音进行吸收,使得训练后的模型学到更精细的特征,提升模型的抽象能力,进一步提升目标类别和非目标类别的区分性,也就提高了声音识别的准确性。
S302、根据样本声音和样本类别对预设神经网络模型执行预设训练步骤,得到第一待定模型。
示例地,可以根据傅里叶变换和梅尔滤波器,对样本声音进行特征提取,得到样本音频特征,该样本音频特征可以包括FBANK特征、MFCC特征或者PNCC特征等。例如,可以针对样本声音,每20ms计算一个1024点的傅里叶变换,窗长为64ms,然后经过64个梅尔滤波器组得到64维的FBANK特征。
可以将样本音频特征输入预设神经网络模型执行预设训练步骤,该预设神经网络模型可以为相关技术中的卷积神经网络,该预设训练步骤可以是相关技术中的卷积神经网络训练步骤。
该预设神经网络模型可以是移动端卷积神经网络的模型,例如MobileNet等,该移动端卷积神经网络可以包括N层卷积层,N可以为大于或等于5的任意正整数,例如N可以为10或16。
S303、对该第一待定模型进行模型压缩,得到目标神经网络模型。
示例地,可以根据该第一待定模型的预设数目个卷积层,获取该目标神经网络模型;其中,该预设数目小于该第一待定模型的卷积层总层数。
S304、根据该样本声音和该样本类别对该目标神经网络模型执行上述预设训练步骤,得到第二待定模型。
示例地,该第一待定模型的卷积层总层数为N,该预设数目可以为M,M小于N,例如N为10,M可以为5。
这样,根据训练后得到的第一待定模型的前M层卷积层的参数作为目标神经网络模型的卷积层的初始化参数。然后对目标神经网络模型执行上述预设训练步骤,得到第二待定模型。
S305、根据该第二待定模型,确定该目标声音分类模型。
示例地,可以将该第二待定模型作为目标声音分类模型。
采用上述方式,根据多个样本声音和每个样本声音对应的样本类别,对预设神经网络模型执行预设训练步骤,得到第一待定模型,对该第一待定模型进行模型压缩,得到目标神经网络模型;在根据样本声音和样本类别对该目标神经网络模型执行预设训练步骤,得到第二待定模型;根据该第二待定模型,确定目标声音分类模型。这样,通过模型压缩和两次训练,既可以降低训练后得到的目标声音分类模型的复杂度,也能够确保该目标声音分类模型对声音分类的准确性,从而使得到的目标声音分类模型更加精简和高效,降低对硬件的依赖程度,从而降低车载端或设备端部署的难度。
进一步地,在上述S305步骤中,还可以对该第二待定模型的模型参数进行模型量化处理后,得到该目标声音分类模型。
其中,该模型量化处理可以包括模型参数压缩,示例地,可以对该模型的参数量化至预设位数,该预设位数可以是8bit或16bit,例如,将所有的浮点参数量化压缩至整数型参数,这样,可以在保证模型性能基本不变的情况下,进一步减少模型的尺寸,降低模型的运算功耗,得到的目标声音分类模型更加精简和高效,降低对硬件的依赖程度,从而降低车载端或设备端部署的难度。
进一步地,上述预设训练步骤可以包括以下方式:
循环执行模型训练步骤对目标模型进行训练,直至根据该样本类别和预测类别确定训练后的目标模型满足预设停止迭代条件,该目标模型包括预设神经网络模型或者目标 神经网络模型,该预测类别为该样本声音输入训练后的目标模型后输出的类别。
需要说明的是,上述预设停止迭代条件可以是现有技术中常用的停止迭代的条件,例如样本类别和预测类别的相似度差异小于预设相似度差异阈值等条件,本公开对此不作限定。
上述模型训练步骤包括:
S11、获取该样本声音与多个该样本类别的第一样本相似度。
示例地,将该样本声音按照预设周期进行特征提取,得到多个周期的样本特征;然后根据多个周期的样本特征,获取该样本声音与多个该样本类别的第一样本相似度。
示例地,上述样本声音可以是大于5秒的任意样本音频数据,上述预设周期可以是20毫秒至2秒之间的任意时间,例如,预设周期可以是1秒或500毫秒,可以将该样本音频数据按照预设周期进行分割,对分割后的音频片段进行特征提取,可以得到每个分割后音频片段的样本特征。
然后,根据多个周期的样本特征,获取该样本声音与多个该样本类别的第一样本相似度可以包括以下相似度获取方式一和相似度获取方式二中的任意一种,其中:
相似度获取方式一可以包括以下步骤:
首先,针对每个周期的样本特征,获取该样本特征对应的第一特征编码。
其次,根据多个样本特征的第一特征编码计算得到样本声音的第二特征编码。
最后,根据第二特征编码获取该样本声音与多个该样本类别的第一样本相似度。
示例地,可以将多个第一特征编码的平均值作为第二特征编码,也就是在输出层之前的嵌入层(Embedding)取平均,之后再根据取平均后的第二特征编码获取该样本声音与多个该样本类别的第一样本相似度。例如,可以计算得到该第二特征编码与每个样本类别对应的样本特征编码的相似度,并将该相似度作为该样本声音与每个样本类别的第一样本相似度。
采用该方式获取第一样本相似度,并对模型进行训练后得到的目标声音分类模型,对与样本声音的时长基本一致的环境声音的识别准确度较高,为了提高环境声音识别的实时性和准确性,在采用该相似度获取方式一进行训练的模型中,可以获取时长较短的样本声音进行训练。
相似度获取方式二可以包括以下步骤:
首先,针对每个周期的样本特征,获取该样本特征对应的第一特征编码;并根据该第一特征编码,得到该样本特征与多个样本类别的第二样本相似度。
然后,根据多个样本特征的第二样本相似度,计算得到样本声音与多个样本类别的第一样本相似度。
示例地,可以将每个周期的样本特征输入卷积层,得到该样本特征对应的第一特征编码,计算得到该第一特征编码与每个样本类别对应的样本特征编码的相似度,并将该相似度作为该样本特征与每个样本类别的第二样本相似度。
然后,针对每个样本类别,计算样本声音的多个样本特征与该样本类别的第二样本相似度的平均值,作为该样本声音与该样本类别的第二样本相似度。
采用该方式获取第一样本相似度,对用于训练的样本声音的长度要求不高,由于对样本声音分段后的音频进行了相似度计算,因此,即使是时间较长的样本声音,也可以提高训练后的目标声音识别模型对环境声音识别的及时性和准确性。
S12、根据该第一样本相似度,从多个该样本类别中确定该样本声音对应的预测类别。
S13、在根据该样本类别和该预测类别确定训练后的目标模型不满足预设停止迭代条件的情况下,根据该样本类别和该预测类别确定目标损失值,根据该目标损失值更新该目标模型的参数,得到训练后的目标模型,并将该训练后的目标模型作为新的目标模型。
同样地,上述预设停止迭代条件可以是现有技术中常用的停止迭代的条件,例如样本类别和预测类别的相似度差异小于预设相似度差异阈值等条件,本公开对此不作限定。
这样,通过上述预设训练步骤,可以对目标模型进行训练,可以提高训练后的目标模板对环境声音进行识别后得到的目标类别的准确性。
图4是根据图1所示实施例示出的一种S102步骤的流程图,如图4所示,上述S102步骤可以包括:
S1021、将该环境声音输入该目标声音分类模型,得到一个或多个第一候选类别,以及该环境声音与每个第一候选类别的第一目标相似度;
S1022、根据第一目标相似度从第一候选类别中确定目标类别。
示例地,可以将该第一目标相似度从大到小排名前N位的第一候选类别作为目标类别;也可以将该第一目标相似度大于或等于预设相似度阈值的第一候选类别作为目标类别;也可以将第一目标相似度从大到小排名前N位,且该第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为目标类别。
在本公开的另一实施例中,上述S1022步骤可以包括:
首先,将该第一目标相似度从大到小排名前N位,且该第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为第二候选类别。
然后,根据该第二候选类别确定该目标类别。
其中,该第二候选类别可以为一个或多个。
若该第二候选类别为一个,可以直接将该第二候选类别作为目标类别。
若该第二候选类别为多个,可以直接将该多个第二候选类别作为目标类别;也可以将第一目标相似度最大的第二候选类别作为目标类别。
进一步地,若该第二候选类别为多个,还可以通过以下方式确定目标类别:
首先,根据预设类别对应关系,确定每个第二候选类别与其他第二候选类别之间的类别关系。
其中,预设类别对应关系包括任意两个第二候选类别之间的类别关系,该类别关系包括混淆关系和同类关系。混淆关系用于表征两个第二候选类别之间为非同类的易混淆的类别,例如“风声”和“哭声”;同类关系用于表征两个第二候选类别之间为相同场景下的类别,例如“哭声”和“人声”。
然后,根据第二候选类别和类别关系,确定该目标类别。
例如,若多个第二候选类别中只包括同类关系的第二候选类别,则可以直接将多个第二候选类别作为目标类别,或者将第一目标相似度最大的第二候选类别作为目标类别。
再例如,若多个第二候选类别中包括混淆关系的第二候选类别,则可以计算该多个第二候选类别的混淆系数,在该混淆系数小于或等于预设混淆门限的情况下,将多个第二候选类别作为目标类别,或者将第一目标相似度最大的第二候选类别作为目标类别;而在该混淆系数大于预设混淆门限的情况下,可以不输出目标类别。
其中,该混淆系数可以表征类别关系为混淆关系的数目在多个第二候选类别的类别关系的总数中所占的比例。示例地,该多个第二候选类别为5个,每个两个第二候选类别之间存在类别关系,该类别关系的总数为6;其中类别关系为混淆关系的数目为3,则该混淆系数可以为0.5。上述预设混淆门限可以为0.7,这样,该混淆系数小于预设混淆门限,因此,可以将该多个第二候选类别作为目标类别,或者将第一目标相似度最大的第二候选类别作为目标类别。
通过该方式,可以根据识别出的候选类别的混淆关系确定模型的识别准确性,在混淆关系满足预设条件(混淆系数小于或等于预设混淆门限)的情况下,确定模型的识别准确性满足条件,使得获取到的目标类别更为准确。
图5是根据一示例性实施例示出的一种声音识别装置500的框图,如图5所示,该装置500可以包括:
声音采集模块501,被配置为通过声音检测装置采集环境声音;
声音分类模块502,被配置为根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别;
展示模块503,被配置为通过展示装置展示该目标类别。
可选地,该展示模块503,被配置为确定该目标类别对应的目标图像;通过该展示装置展示该目标图像。
可选地,该展示模块503,被配置为根据类别图像对应关系,确定该目标类别对应的目标图像,该类别图像对应关系包括该目标类别与该目标图像的对应关系。
可选地,该展示模块503,被配置为在该展示装置的预设区域展示该目标图像。
可选地,该展示装置包括车辆的外反光镜、内反光镜和中控屏幕中的一种或多种。
可选地,在该展示装置包括车辆的外反光镜的情况下,该展示模块503,被配置为在检测到副驾驶座位有乘客乘坐的情况下,通过车辆两侧的外反光镜展示该目标图像;或者,在未检测到副驾驶座位有乘客乘坐的情况下,通过车辆驾驶员侧的外反光镜展示该目标图像。
可选地,该声音检测装置为一个或多个,该声音检测装置设置在车辆的任意一侧或多侧的外反光镜内,该环境声音为该车辆周边的环境声音。
可选地,在该声音检测装置为多个的情况下,该多个声音检测装置分别设置在车辆的左侧外反光镜内和右侧外反光镜内。
可选地,该声音分类模块502,被配置为将该环境声音输入该目标声音分类模型,得到一个或多个第一候选类别,以及该环境声音与每个第一候选类别的第一目标相似度;根据该第一目标相似度从该第一候选类别中确定该目标类别。
可选地,该声音分类模块502,被配置为将该第一目标相似度从大到小排名前N位,且该第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为第二候选类别;根据该第二候选类别确定该目标类别。
可选地,该第二候选类别为多个,该声音分类模块502,被配置为根据预设类别对应关系,确定每个第二候选类别与其他第二候选类别之间的类别关系;该预设类别对应关系包括任意两个第二候选类别之间的类别关系,该类别关系包括混淆关系和同类关系;根据该第二候选类别和类别关系,确定该目标类别。
可选地,该目标声音分类模型为根据目标神经网络模型训练后得到的,该目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。
图6是根据一示例性实施例示出的另一种声音识别装置的框图,如图6所示,该装置还可以包括模型训练模块601,该模型训练模块601,被配置为:
获取用于训练的多个样本声音和每个该样本声音对应的样本类别;
根据该样本声音和该样本类别对预设神经网络模型执行预设训练步骤,得到第一待定模型;
对该第一待定模型进行模型压缩,得到目标神经网络模型;
根据该样本声音和该样本类别对该目标神经网络模型执行该预设训练步骤,得到第二待定模型;
根据该第二待定模型,确定该目标声音分类模型。
可选地,该模型训练模块601,被配置为根据该第一待定模型的预设数目个卷积层,获取该目标神经网络模型;其中,该预设数目小于该第一待定模型的卷积层总层数。
可选地,该模型训练模块601,被配置为循环执行模型训练步骤对目标模型进行训练,直至根据该样本类别和预测类别确定训练后的目标模型满足预设停止迭代条件,该目标模型包括预设神经网络模型或者该目标神经网络模型,该预测类别为该样本声音输入该训练后的目标模型后输出的类别;
该模型训练步骤包括:
获取该样本声音与多个该样本类别的第一样本相似度;
根据该第一样本相似度,从多个该样本类别中确定该样本声音对应的预测类别;
在根据该样本类别和该预测类别确定训练后的目标模型不满足预设停止迭代条件的情况下,根据该样本类别和该预测类别确定目标损失值,根据该目标损失值更新该目标模型的参数,得到训练后的目标模型,并将该训练后的目标模型作为新的目标模型。
可选地,该模型训练模块601,被配置为将该样本声音按照预设周期进行特征提取,得到多个周期的样本特征;根据多个周期的样本特征,获取该样本声音与多个该样本类别的第一样本相似度。
可选地,该模型训练模块601,被配置为针对每个周期的样本特征,获取该样本特征对应的第一特征编码;并根据该第一特征编码,得到该样本特征与多个样本类别的第二样本相似度;根据多个样本特征的该第二样本相似度,计算得到该样本声音与多个该样本类别的第一样本相似度。
可选地,该样本类别包括该目标类别和非目标类别。
可选地,该模型训练模块601,被配置为对该第二待定模型的模型参数进行模型量化处理,得到该目标声音分类模型。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
综上所述,采用本公开上述实施例中的装置,通过声音检测装置采集环境声音;根据目标声音分类模型对该环境声音进行分类处理,得到该环境声音对应的目标类别;其中,该目标声音分类模型为根据目标神经网络模型训练后得到的,该目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。这样,可以通过对环境声音进行声音识别从而实现对周边环境物体的全面准确的识别和分类,从而解决摄像头或雷达检测出现盲区的问题,提高了物体检测的可靠性。并且,为了能够在车辆端或设备端进行声音检测,可以通过模型压缩降低目标声音分类模型的复杂度,同时又通过两次训练保障了训练后的目标声音分类模型的声音分类准确度,从而可以将目标声音分类模型部署在车辆端或设备端,提高了声音识别和分类的及时性。
本公开还提供一种计算机可读存储介质,其上存储有计算机程序指令,该程序指令被处理器执行时实现本公开提供的声音识别方法的步骤。
图7是根据一示例性实施例示出的电子设备900的框图。例如,电子设备900可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理、路由器、车载终端等。
参照图7,电子设备900可以包括以下一个或多个组件:处理组件902,存储器904,电力组件906,多媒体组件908,音频组件910,输入/输出(I/O)接口912,传感器组件914,以及通信组件916。
处理组件902通常控制电子设备900的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件902可以包括一个或多个处理器920来执行指令,以完成上述声音识别方法的全部或部分步骤。此外,处理组件902可以包括一个或多个模块,便于处理组件902和其他组件之间的交互。例如,处理组件902可 以包括多媒体模块,以方便多媒体组件908和处理组件902之间的交互。
存储器904被配置为存储各种类型的数据以支持在电子设备900的操作。这些数据的示例包括用于在电子设备900上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电力组件906为电子设备900的各种组件提供电力。电力组件906可以包括电源管理系统,一个或多个电源,及其他与为电子设备900生成、管理和分配电力相关联的组件。
多媒体组件908包括在所述电子设备900和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件908包括一个前置摄像头和/或后置摄像头。当电子设备900处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件910被配置为输出和/或输入音频信号。例如,音频组件910包括一个麦克风(MIC),当电子设备900处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中,音频组件910还包括一个扬声器,用于输出音频信号。
I/O接口912为处理组件902和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件914包括一个或多个传感器,用于为电子设备900提供各个方面的状态评估。例如,传感器组件914可以检测到电子设备900的打开/关闭状态,组件的相对定位,例如所述组件为电子设备900的显示器和小键盘,传感器组件914还可以检测电子设备900或电子设备900一个组件的位置改变,用户与电子设备900接触的存在或不存在,电子设备900方位或加速/减速和电子设备900的温度变化。传感器组件914可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件914还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件916被配置为便于电子设备900和其他设备之间有线或无线方式的通信。电子设备900可以接入基于通信标准的无线网络,例如Wi-Fi,2G、3G、4G、5G、NB-IOT、eMTC、或其他6G等,或它们的组合。在一个示例性实施例中,通信组件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件916还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备900可以被一个或多个应用专用集成电路(ASIC)、 数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述声音识别方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器904,上述指令可由电子设备900的处理器920执行以完成上述声音识别方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
在另一示例性实施例中,还提供一种计算机程序产品,该计算机程序产品包含能够由可编程的装置执行的计算机程序,该计算机程序具有当由该可编程的装置执行时用于执行上述声音识别方法的步骤的代码部分。
图8是根据一示例性实施例示出的车辆的框图,如图8所示,该装置可以包括上述电子设备900。
本领域技术人员在考虑说明书及实践本公开后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (24)

  1. 一种声音识别方法,其特征在于,所述方法包括:
    通过声音检测装置采集环境声音;
    根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别;
    通过展示装置展示所述目标类别。
  2. 根据权利要求1所述的方法,其特征在于,所述通过展示装置展示所述目标类别包括:
    确定所述目标类别对应的目标图像;
    通过所述展示装置展示所述目标图像。
  3. 根据权利要求2所述的方法,其特征在于,所述确定所述目标类别对应的目标图像包括:
    根据类别图像对应关系,确定所述目标类别对应的目标图像,所述类别图像对应关系包括所述目标类别与所述目标图像的对应关系。
  4. 根据权利要求2所述的方法,其特征在于,所述通过所述展示装置展示所述目标图像包括:
    在所述展示装置的预设区域展示所述目标图像。
  5. 根据权利要求2所述的方法,其特征在于,所述展示装置包括车辆的外反光镜、内反光镜和中控屏幕中的一种或多种。
  6. 根据权利要求5所述的方法,其特征在于,在所述展示装置包括车辆的外反光镜的情况下,所述通过所述展示装置展示所述目标图像包括:
    在检测到副驾驶座位有乘客乘坐的情况下,通过车辆两侧的外反光镜展示所述目标图像;或者,
    在未检测到副驾驶座位有乘客乘坐的情况下,通过车辆驾驶员侧的外反光镜展示所述目标图像。
  7. 根据权利要求1所述的方法,其特征在于,所述声音检测装置为一个或多个,所述声音检测装置设置在车辆的任意一侧或多侧的外反光镜内,所述环境声音为所述车辆周边的环境声音。
  8. 根据权利要求7所述的方法,其特征在于,在所述声音检测装置为多个的情况下,所述多个声音检测装置分别设置在车辆的左侧外反光镜内和右侧外反光镜内。
  9. 根据权利要求1所述的方法,其特征在于,所述根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别包括:
    将所述环境声音输入所述目标声音分类模型,得到一个或多个第一候选类别,以及所述环境声音与每个第一候选类别的第一目标相似度;
    根据所述第一目标相似度从所述第一候选类别中确定所述目标类别。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述第一目标相似度从所 述第一候选类别中确定所述目标类别包括:
    将所述第一目标相似度从大到小排名前N位,且所述第一目标相似度大于或等于预设相似度阈值的第一候选类别,作为第二候选类别;
    根据所述第二候选类别确定所述目标类别。
  11. 根据权利要求10所述的方法,其特征在于,所述第二候选类别为多个,所述根据所述第二候选类别确定所述目标类别包括:
    根据预设类别对应关系,确定每个第二候选类别与其他第二候选类别之间的类别关系;所述预设类别对应关系包括任意两个第二候选类别之间的类别关系,所述类别关系包括混淆关系和同类关系;
    根据所述第二候选类别和类别关系,确定所述目标类别。
  12. 根据权利要求1至11中任一项所述的方法,其特征在于,所述目标声音分类模型为根据目标神经网络模型训练后得到的,所述目标神经网络模型是对预设神经网络模型进行训练,并对训练后的预设神经网络模型进行模型压缩得到的模型。
  13. 根据权利要求12所述的方法,其特征在于,所述目标声音分类模型是通过以下方式训练得到的:
    获取用于训练的多个样本声音和每个所述样本声音对应的样本类别;
    根据所述样本声音和所述样本类别对预设神经网络模型执行预设训练步骤,得到第一待定模型;
    对所述第一待定模型进行模型压缩,得到目标神经网络模型;
    根据所述样本声音和所述样本类别对所述目标神经网络模型执行所述预设训练步骤,得到第二待定模型;
    根据所述第二待定模型,确定所述目标声音分类模型。
  14. 根据权利要求13所述的方法,其特征在于,所述对所述第一待定模型进行模型压缩,得到目标神经网络模型包括:
    根据所述第一待定模型的预设数目个卷积层,获取所述目标神经网络模型;其中,所述预设数目小于所述第一待定模型的卷积层总层数。
  15. 根据权利要求13所述的方法,其特征在于,所述预设训练步骤包括:
    循环执行模型训练步骤对目标模型进行训练,直至根据所述样本类别和预测类别确定训练后的目标模型满足预设停止迭代条件,所述目标模型包括预设神经网络模型或者所述目标神经网络模型,所述预测类别为所述样本声音输入该训练后的目标模型后输出的类别;
    所述模型训练步骤包括:
    获取所述样本声音与多个所述样本类别的第一样本相似度;
    根据所述第一样本相似度,从多个所述样本类别中确定所述样本声音对应的预测类别;
    在根据所述样本类别和所述预测类别确定训练后的目标模型不满足预设停止迭代条件的情况下,根据所述样本类别和所述预测类别确定目标损失值,根据所述目标损失值更新所述目标模型的参数,得到训练后的目标模型,并将该训练后的目标模型作为新的目标模型。
  16. 根据权利要求15所述的方法,其特征在于,所述获取所述样本声音与多个所述样本类别的第一样本相似度包括:
    将所述样本声音按照预设周期进行特征提取,得到多个周期的样本特征;
    根据多个周期的样本特征,获取所述样本声音与多个所述样本类别的第一样本相似度。
  17. 根据权利要求16所述的方法,其特征在于,所述根据多个周期的样本特征,获取所述样本声音与多个所述样本类别的第一样本相似度包括:
    针对每个周期的样本特征,获取该样本特征对应的第一特征编码;并根据所述第一特征编码,得到该样本特征与多个样本类别的第二样本相似度;
    根据多个样本特征的所述第二样本相似度,计算得到所述样本声音与多个所述样本类别的第一样本相似度。
  18. 根据权利要求13所述的方法,其特征在于,所述样本类别包括所述目标类别和非目标类别。
  19. 根据权利要求13所述的方法,其特征在于,所述根据所述第二待定模型,确定所述目标声音分类模型包括:
    对所述第二待定模型的模型参数进行模型量化处理,得到所述目标声音分类模型。
  20. 一种声音识别装置,其特征在于,所述装置包括:
    声音采集模块,被配置为通过声音检测装置采集环境声音;
    声音分类模块,被配置为根据目标声音分类模型对所述环境声音进行分类处理,得到所述环境声音对应的目标类别;
    展示模块,被配置为通过展示装置展示所述目标类别。
  21. 一种电子设备,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行权利要求1至19中任一项所述方法的步骤。
  22. 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,该程序指令被处理器执行时实现权利要求1至19中任一项所述方法的步骤。
  23. 一种车辆,其特征在于,所述车辆包括权利要求21所述的电子设备。
  24. 一种计算机程序产品,其特征在于,该计算机程序产品包含能够由可编程的装置执行的计算机程序,该计算机程序具有当由该可编程的装置执行时用于执行权利要求1至19中任一项所述方法的步骤的代码部分。
PCT/CN2022/090554 2022-01-18 2022-04-29 声音识别方法、装置、介质、设备、程序产品及车辆 WO2023137908A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210055284.9 2022-01-18
CN202210055284.9A CN114420163B (zh) 2022-01-18 2022-01-18 声音识别方法、装置、存储介质、电子设备及车辆

Publications (1)

Publication Number Publication Date
WO2023137908A1 true WO2023137908A1 (zh) 2023-07-27

Family

ID=81273884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090554 WO2023137908A1 (zh) 2022-01-18 2022-04-29 声音识别方法、装置、介质、设备、程序产品及车辆

Country Status (2)

Country Link
CN (1) CN114420163B (zh)
WO (1) WO2023137908A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107031628A (zh) * 2015-10-27 2017-08-11 福特全球技术公司 使用听觉数据的碰撞规避
US20180108369A1 (en) * 2016-10-19 2018-04-19 Ford Global Technologies, Llc Vehicle Ambient Audio Classification Via Neural Network Machine Learning
CN109803207A (zh) * 2017-11-17 2019-05-24 英特尔公司 对周围声音中的音频信号的标识以及响应于该标识的对自主交通工具的引导
CN110047512A (zh) * 2019-04-25 2019-07-23 广东工业大学 一种环境声音分类方法、系统及相关装置
CN110348572A (zh) * 2019-07-09 2019-10-18 上海商汤智能科技有限公司 神经网络模型的处理方法及装置、电子设备、存储介质
US20200241552A1 (en) * 2019-01-24 2020-07-30 Aptiv Technologies Limited Using classified sounds and localized sound sources to operate an autonomous vehicle
CN111483461A (zh) * 2019-01-25 2020-08-04 三星电子株式会社 包括声音传感器的车辆驾驶控制装置及车辆驾驶控制方法
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN111898484A (zh) * 2020-07-14 2020-11-06 华中科技大学 生成模型的方法、装置、可读存储介质及电子设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2897745Y (zh) * 2004-08-03 2007-05-09 彭小毛 汽车车内与车外的声音传输装置
CN107293308B (zh) * 2016-04-01 2019-06-07 腾讯科技(深圳)有限公司 一种音频处理方法及装置
CN106504768B (zh) * 2016-10-21 2019-05-03 百度在线网络技术(北京)有限公司 基于人工智能的电话拨测音频分类方法及装置
DE102018200054A1 (de) * 2018-01-03 2019-07-04 Ford Global Technologies, Llc Vorrichtung zur Totwinkelüberwachung eines Kraftfahrzeugs
US20200184991A1 (en) * 2018-12-05 2020-06-11 Pascal Cleve Sound class identification using a neural network
CN110414406A (zh) * 2019-07-23 2019-11-05 广汽蔚来新能源汽车科技有限公司 车内对象监管方法、装置、系统、车载终端和存储介质
CN112339760A (zh) * 2020-11-06 2021-02-09 广州小鹏汽车科技有限公司 车辆行驶控制方法、控制装置、车辆和可读存储介质
CN113183901B (zh) * 2021-06-03 2022-11-22 亿咖通(湖北)技术有限公司 车载座舱环境控制方法、车辆以及电子设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107031628A (zh) * 2015-10-27 2017-08-11 福特全球技术公司 使用听觉数据的碰撞规避
US20180108369A1 (en) * 2016-10-19 2018-04-19 Ford Global Technologies, Llc Vehicle Ambient Audio Classification Via Neural Network Machine Learning
CN109803207A (zh) * 2017-11-17 2019-05-24 英特尔公司 对周围声音中的音频信号的标识以及响应于该标识的对自主交通工具的引导
US20200241552A1 (en) * 2019-01-24 2020-07-30 Aptiv Technologies Limited Using classified sounds and localized sound sources to operate an autonomous vehicle
CN111483461A (zh) * 2019-01-25 2020-08-04 三星电子株式会社 包括声音传感器的车辆驾驶控制装置及车辆驾驶控制方法
CN110047512A (zh) * 2019-04-25 2019-07-23 广东工业大学 一种环境声音分类方法、系统及相关装置
CN110348572A (zh) * 2019-07-09 2019-10-18 上海商汤智能科技有限公司 神经网络模型的处理方法及装置、电子设备、存储介质
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN111898484A (zh) * 2020-07-14 2020-11-06 华中科技大学 生成模型的方法、装置、可读存储介质及电子设备

Also Published As

Publication number Publication date
CN114420163B (zh) 2023-04-07
CN114420163A (zh) 2022-04-29

Similar Documents

Publication Publication Date Title
CN111741884B (zh) 交通遇险和路怒症检测方法
US20200317190A1 (en) Collision Control Method, Electronic Device and Storage Medium
US11354907B1 (en) Sonic sensing
WO2020134858A1 (zh) 人脸属性识别方法及装置、电子设备和存储介质
US9013575B2 (en) Doorbell communication systems and methods
CN110276235B (zh) 通过感测瞬态事件和连续事件的智能装置的情境感知
JP6251906B2 (ja) 状況(Context)に基づくスマートフォンセンサロジック
EP4064284A1 (en) Voice detection method, prediction model training method, apparatus, device, and medium
US20130070928A1 (en) Methods, systems, and media for mobile audio event recognition
US20160150338A1 (en) Sound event detecting apparatus and operation method thereof
JP2017505477A (ja) ドライバ行動監視システムおよびドライバ行動監視のための方法
US10614693B2 (en) Dangerous situation notification apparatus and method
KR102420567B1 (ko) 음성 인식 장치 및 방법
US20190047578A1 (en) Methods and apparatus for detecting emergency events based on vehicle occupant behavior data
US11355124B2 (en) Voice recognition method and voice recognition apparatus
CN110091877A (zh) 用于车辆安全驾驶的控制方法、系统及车辆
CN110880328B (zh) 到站提醒方法、装置、终端及存储介质
CN106650603A (zh) 车辆周边监控方法、装置和车辆
WO2021115232A1 (zh) 到站提醒方法、装置、终端及存储介质
CN114764911B (zh) 障碍物信息检测方法、装置、电子设备及存储介质
CN114332941A (zh) 基于乘车对象检测的报警提示方法、装置及电子设备
WO2023137908A1 (zh) 声音识别方法、装置、介质、设备、程序产品及车辆
CN115171692A (zh) 一种语音交互方法和装置
CN115497478A (zh) 一种车辆内外交流的方法、装置和可读存储介质
CN115171678A (zh) 语音识别方法、装置、电子设备、存储介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921339

Country of ref document: EP

Kind code of ref document: A1