CN114420163B - Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle - Google Patents

Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle Download PDF

Info

Publication number
CN114420163B
CN114420163B CN202210055284.9A CN202210055284A CN114420163B CN 114420163 B CN114420163 B CN 114420163B CN 202210055284 A CN202210055284 A CN 202210055284A CN 114420163 B CN114420163 B CN 114420163B
Authority
CN
China
Prior art keywords
target
sound
sample
category
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210055284.9A
Other languages
Chinese (zh)
Other versions
CN114420163A (en
Inventor
闫志勇
丁翰林
王永庆
张俊博
王育军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaomi Automobile Technology Co Ltd
Original Assignee
Xiaomi Automobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Automobile Technology Co Ltd filed Critical Xiaomi Automobile Technology Co Ltd
Priority to CN202210055284.9A priority Critical patent/CN114420163B/en
Priority to PCT/CN2022/090554 priority patent/WO2023137908A1/en
Publication of CN114420163A publication Critical patent/CN114420163A/en
Application granted granted Critical
Publication of CN114420163B publication Critical patent/CN114420163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • G06F3/1407General aspects irrespective of display type, e.g. determination of decimal point position, display with fixed or driving decimal point, suppression of non-significant zeros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The disclosure relates to a voice recognition method, a voice recognition device, a storage medium, an electronic apparatus, and a vehicle. The method comprises the following steps: collecting environmental sound through a sound detection device; classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound; and displaying the target category through a display device. Therefore, the surrounding environment objects can be comprehensively and accurately identified and classified by carrying out sound identification on the environment sound, and the reliability of object detection is improved.

Description

Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a voice recognition method, an apparatus, a storage medium, an electronic device, and a vehicle.
Background
With the application of artificial intelligence technology in the field of vehicle technology, the technology of automatic driving or auxiliary driving of vehicles has been widely developed. In the related art, information of a peripheral object is mainly acquired by a radar or a camera, and automatic driving or driving assistance is realized through image processing. And radar or camera all can have the detection blind area, have the problem that environmental detection reliability is lower under some specific scenes.
Disclosure of Invention
To overcome the above problems in the related art, the present disclosure provides a voice recognition method, apparatus, storage medium, electronic device, and vehicle.
According to a first aspect of embodiments of the present disclosure, there is provided a voice recognition method, the method including:
collecting environmental sound through a sound detection device;
classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound;
and displaying the target category through a display device.
Optionally, the displaying the object category by a display device includes:
determining a target image corresponding to the target category;
and displaying the target image through the display device.
Optionally, the determining a target image corresponding to the target category includes:
and determining a target image corresponding to the target category according to a category image corresponding relation, wherein the category image corresponding relation comprises the corresponding relation between the target category and the target image.
Optionally, the displaying the target image through the display device includes:
and displaying the target image in a preset area of the display device.
Optionally, the display device comprises one or more of an exterior mirror, an interior mirror and a centre screen of the vehicle.
Optionally, in a case that the display device includes an external mirror of a vehicle, the displaying the target image by the display device includes:
under the condition that a passenger is detected to be seated in the passenger seat, displaying the target image through outer reflectors on two sides of the vehicle; alternatively, the first and second electrodes may be,
and in the case that the passenger is not detected to be seated in the passenger seat, displaying the target image through an external reflector on the side of the driver of the vehicle.
Optionally, the number of the sound detection devices is one or more, the sound detection devices are arranged in the outer reflector on any one side or multiple sides of the vehicle, and the ambient sound is ambient sound around the vehicle.
Alternatively, in the case where the sound detection device is plural, the plural sound detection devices are respectively provided in the left outer mirror and the right outer mirror of the vehicle.
Optionally, the classifying the environmental sound according to the target sound classification model to obtain the target class corresponding to the environmental sound includes:
inputting the environmental sound into the target sound classification model to obtain one or more first candidate categories and a first target similarity of the environmental sound and each first candidate category;
determining the target class from the first candidate class according to the first target similarity.
Optionally, the determining the target class from the first candidate class according to the first target similarity includes:
ranking the first target similarity from big to small by N, and taking the first candidate category of which the first target similarity is greater than or equal to a preset similarity threshold as a second candidate category;
determining the target class according to the second candidate class.
Optionally, the second candidate categories are multiple, and the determining the target category according to the second candidate categories includes:
determining the category relation between each second candidate category and other second candidate categories according to the preset category corresponding relation; the preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation;
and determining the target category according to the second candidate category and the category relation.
Optionally, the target sound classification model is obtained by training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
Optionally, the target sound classification model is trained by:
obtaining a plurality of sample sounds for training and a sample class corresponding to each sample sound;
performing a preset training step on a preset neural network model according to the sample sound and the sample category to obtain a first model to be fixed;
performing model compression on the first model to be fixed to obtain a target neural network model;
executing the preset training step on the target neural network model according to the sample sound and the sample category to obtain a second undetermined model;
and determining the target sound classification model according to the second undetermined model.
Optionally, the performing model compression on the first to-be-determined model to obtain a target neural network model includes:
acquiring the target neural network model according to the preset number of convolutional layers of the first to-be-determined model; the preset number is smaller than the total number of the convolutional layers of the first to-be-determined model.
Optionally, the preset training step includes:
training a target model by circularly executing a model training step until the trained target model meets a preset iteration stopping condition according to the sample class and a prediction class, wherein the target model comprises a preset neural network model or the target neural network model, and the prediction class is a class output after the sample sound is input into the trained target model;
the model training step comprises:
obtaining a first sample similarity of the sample sound and a plurality of sample categories;
determining a prediction category corresponding to the sample sound from a plurality of sample categories according to the first sample similarity;
and under the condition that the trained target model does not meet the preset iteration stopping condition according to the sample category and the prediction category, determining a target loss value according to the sample category and the prediction category, updating parameters of the target model according to the target loss value to obtain the trained target model, and taking the trained target model as a new target model.
Optionally, the obtaining the first similarity of the sample sound and the plurality of sample categories comprises:
carrying out feature extraction on the sample sound according to a preset period to obtain sample features of multiple periods;
and according to the sample characteristics of a plurality of periods, obtaining the similarity of the sample sound and the first samples of a plurality of sample categories.
Optionally, the obtaining, according to the sample characteristics of the plurality of cycles, first similarity between the sample sound and a plurality of sample categories includes:
acquiring a first feature code corresponding to the sample feature for the sample feature of each period; according to the first feature codes, obtaining the similarity of the sample features and second samples of a plurality of sample categories;
and calculating to obtain the first sample similarity of the sample sound and the plurality of sample categories according to the second sample similarity of the plurality of sample characteristics.
Optionally, the sample categories include the target category and a non-target category.
Optionally, the determining the target sound classification model according to the second undetermined model includes:
and carrying out model quantization processing on the model parameters of the second undetermined model to obtain the target sound classification model.
According to a second aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus, the apparatus including:
a sound collection module configured to collect an ambient sound by a sound detection device;
the sound classification module is configured to classify the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound;
a presentation module configured to present the object category through a presentation device.
Optionally, the presentation module is configured to determine a target image corresponding to the target category; and displaying the target image through the display device.
Optionally, the display module is configured to determine a target image corresponding to the target category according to a category image corresponding relationship, where the category image corresponding relationship includes a corresponding relationship between the target category and the target image.
Optionally, the display module is configured to display the target image in a preset area of the display device.
Optionally, the display device comprises one or more of an exterior mirror, an interior mirror and a centre screen of the vehicle.
Optionally, in a case where the display device includes an external mirror of the vehicle, the display module is configured to display the target image through external mirrors on both sides of the vehicle in a case where it is detected that a passenger is seated in the passenger seat; or, in the case that the passenger is not detected to be seated in the passenger seat, the target image is displayed through an external reflecting mirror on the side of the driver of the vehicle.
Optionally, the number of the sound detection devices is one or more, the sound detection devices are arranged in the outer reflector on any one side or multiple sides of the vehicle, and the ambient sound is ambient sound around the vehicle.
Alternatively, in the case where the sound detection device is plural, the plural sound detection devices are respectively provided in the left outer mirror and the right outer mirror of the vehicle.
Optionally, the sound classification module is configured to input the ambient sound into the target sound classification model, to obtain one or more first candidate categories and a first target similarity between the ambient sound and each first candidate category; determining the target class from the first candidate class according to the first target similarity.
Optionally, the sound classification module is configured to rank the first target similarity from large to small by N top, and the first target similarity is greater than or equal to a preset similarity threshold, as a first candidate category; determining the target class according to the second candidate class.
Optionally, the second candidate categories are multiple, and the sound classification module is configured to determine a category relationship between each second candidate category and other second candidate categories according to a preset category correspondence; the preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation; and determining the target category according to the second candidate category and the category relation.
Optionally, the target sound classification model is obtained by training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
Optionally, the apparatus further comprises a model training module; the model training module configured to:
obtaining a plurality of sample sounds for training and a sample class corresponding to each sample sound;
performing a preset training step on a preset neural network model according to the sample sound and the sample category to obtain a first model to be fixed;
performing model compression on the first model to be fixed to obtain a target neural network model;
executing the preset training step on the target neural network model according to the sample sound and the sample category to obtain a second undetermined model;
and determining the target sound classification model according to the second undetermined model.
Optionally, the model training module is configured to obtain the target neural network model according to a preset number of convolutional layers of the first to-be-determined model; the preset number is smaller than the total number of the convolutional layers of the first to-be-determined model.
Optionally, the preset training step includes:
training a target model by circularly executing a model training step until the trained target model meets a preset iteration stopping condition according to the sample class and a prediction class, wherein the target model comprises a preset neural network model or the target neural network model, and the prediction class is a class output after the sample sound is input into the trained target model;
the model training step comprises:
obtaining a similarity between the sample sound and a first sample of a plurality of sample categories;
determining a prediction category corresponding to the sample sound from a plurality of sample categories according to the first sample similarity;
and under the condition that the trained target model does not meet the preset iteration stopping condition according to the sample category and the prediction category, determining a target loss value according to the sample category and the prediction category, updating parameters of the target model according to the target loss value to obtain the trained target model, and taking the trained target model as a new target model.
Optionally, the model training module is configured to perform feature extraction on the sample sound according to a preset period to obtain sample features of multiple periods; and according to the sample characteristics of a plurality of periods, obtaining the similarity of the sample sound and the first samples of a plurality of sample categories.
Optionally, the model training module is configured to, for a sample feature of each period, obtain a first feature code corresponding to the sample feature; according to the first feature codes, obtaining the similarity of the sample features and second samples of a plurality of sample categories; and calculating to obtain the first sample similarity of the sample sound and the plurality of sample categories according to the second sample similarity of the plurality of sample characteristics.
Optionally, the sample categories include the target category and a non-target category.
Optionally, the model training module is configured to perform model quantization processing on the model parameters of the second undetermined model to obtain the target sound classification model.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the steps of the voice recognition method provided by the first aspect of the present disclosure.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the sound recognition method provided by the first aspect of the present disclosure.
According to a fifth aspect of the embodiments of the present disclosure, there is provided a vehicle including the electronic apparatus provided by the third aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: collecting environmental sound through a sound detection device; classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound; and displaying the target category through a display device. Like this, thereby can carry out sound identification through to environmental sound and realize all-round accurate discernment and classification to all ring edge borders object to solve camera or radar and detect the problem that the blind area appears, improved object detection's reliability. Furthermore, in order to perform sound detection at the vehicle end or the equipment end, the complexity of the target sound classification model can be reduced through model compression, and meanwhile, the sound classification accuracy of the trained target sound classification model is guaranteed through two times of training, so that the target sound classification model can be deployed at the vehicle end or the equipment end, and the timeliness of sound identification and classification is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of voice recognition according to an example embodiment.
Fig. 2 is a schematic diagram illustrating a sound detection device provided on a vehicle exterior mirror according to an exemplary embodiment.
FIG. 3 is a flow diagram illustrating a method of training a target sound classification model according to an exemplary embodiment.
Fig. 4 is a flowchart illustrating a step S102 according to the embodiment shown in fig. 1.
Fig. 5 is a block diagram illustrating a voice recognition device according to an example embodiment.
Fig. 6 is a block diagram illustrating another voice recognition apparatus according to an example embodiment.
FIG. 7 is a block diagram of an electronic device shown in accordance with an example embodiment.
FIG. 8 is a block diagram of a vehicle shown in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
First, an application scenario of the present disclosure will be explained. The method and the system can be applied to sound recognition scenes, such as automatic vehicle driving or auxiliary driving, intelligent home monitoring, health detection, machine production line defective product screening, industrial equipment fault detection and the like based on sound recognition. In order to automatically drive a vehicle, in the related art, information of a peripheral object is mainly acquired by a radar or a camera, and automatic driving or driving assistance is realized by image processing. The radar or the camera has a detection blind area, for example, the laser radar can position objects at a distance of several meters around the vehicle body, and cannot position a remote moving object beyond the range; under the condition that the vehicle body camera is adopted for visual identification, a certain visual detection blind area exists, for example, the video at a longer distance is blurred, so that identification cannot be carried out, or under the condition that the camera is shielded and cannot be detected, accurate identification of peripheral objects is difficult to realize through the camera, so that the reliability of environment detection is reduced, and further the reliability of automatic driving of the vehicle is influenced.
The application scenario of automatic driving of a vehicle is taken as an example for the present disclosure, but the present disclosure is not limited to the application scenario, and for example, the method provided by the present disclosure may be used in scenarios such as smart home monitoring based on voice recognition, health detection, machine production line defect screening, and industrial equipment fault detection.
In order to solve the above problems, the present disclosure provides a sound recognition method, device, storage medium, electronic device, and vehicle, which can collect an environmental sound through a sound detection device; and classifying the environmental sounds according to the target sound classification model to obtain the target category corresponding to the environmental sounds, so that the problem of blind areas caused by camera or radar detection is solved, and the reliability of object detection is improved.
The present disclosure is described below with reference to specific examples.
Fig. 1 illustrates a voice recognition method according to an example embodiment, which may include, as shown in fig. 1:
s101, collecting environmental sounds through a sound detection device.
The sound detection device may comprise one or more sound sensors, such as an electrodynamic microphone, a capacitive microphone, or a MEMS (Micro-Electro-Mechanical System) microphone, for example.
The installation position of the sound detection device can be different in different application scenes, for example, in a vehicle automatic driving or driving-assistant scene, the sound detection device can be arranged at any one or more positions outside the vehicle body of the vehicle, such as a vehicle side vehicle body position, a vehicle side window position, a vehicle front face position, a vehicle rear face position, a vehicle roof position or a vehicle outside rear view mirror position, and the like, and the ambient sound around the vehicle can be collected through the sound detection device. Under the intelligent home monitoring scene, the sound detection device can be installed in each room in the home, and the environment sound of each room can be collected through the sound detection device.
And S102, classifying the environmental sound according to the target sound classification model to obtain a target class corresponding to the environmental sound.
The target sound classification model may be obtained by training a general sound classification model according to the sample sound.
And S103, displaying the object type through a display device.
The display device may include an image display device (e.g., a display screen), and a sound display device (e.g., a buzzer or sounder).
For example, in an automatic driving or driving-assisted scene of a vehicle, the display device may include one or more of an external mirror, an internal mirror and a center screen of the vehicle, and may display a target image corresponding to a target category so as to prompt a user of the target category appearing in the environment. Wherein the outer reflecting mirror may include two outer reflecting mirrors at both sides of the vehicle.
Further, the display device may also include a car audio device, and the car audio device may provide a target sound corresponding to the target category.
By adopting the method, the environmental sound is collected by the sound detection device; and classifying the environmental sound according to the target sound classification model to obtain a target class corresponding to the environmental sound, and displaying the target class through a display device. Like this, thereby can carry out sound identification through to environmental sound and realize all-round accurate discernment and classification to all ring edge borders object to solve camera or radar and detect the problem that the blind area appears, improved object detection's reliability.
Further, the step S103 may show the target category by:
first, a target image corresponding to a target category is determined.
For example, the target image corresponding to the target category may be determined according to a category image correspondence relationship, where the category image correspondence relationship includes a correspondence relationship between the target category and the target image. For example, the target image corresponding to the target category "person" is a "human-shaped image"; the target image corresponding to the target class "animal" is a "quadruped image"; the target image corresponding to the target category "ambulance" is an "ambulance image" or the like.
Then, the target image is displayed through the display device.
For example, the target image may be displayed in a preset area of the display apparatus.
In the case that the display device is an external reflector, the target image may be displayed in a preset area of the external reflector on any one or more sides. The preset region may be a side region, for example, the preset region may be one or more of an upper side region, a lower side region, a left side region, or a right side region of the external mirror.
Therefore, the target image corresponding to the target category can be displayed through the display device, so that the user is accurately prompted to appear the target category, and the user is assisted to carry out corresponding emergency treatment.
It should be noted that, if the display device includes a plurality of external mirrors, internal mirrors and a central control screen of the vehicle, a plurality of display devices can be backed up with each other, so that the reliability of image display is improved, and the problem that a target image cannot be displayed due to a fault of one display device is avoided.
Further, in the case where the presentation apparatus includes an external mirror of a vehicle, the target image may be presented by:
under the condition that a passenger is detected to be seated in the passenger seat, the target image is displayed through outer reflectors on two sides of the vehicle; alternatively, the first and second liquid crystal display panels may be,
in the case where it is not detected that a passenger is seated in the passenger seat, the target image is displayed by an external mirror on the driver's side of the vehicle.
In this way, the target image can be displayed on the corresponding external reflector through the passenger carrying condition of the vehicle, so as to prompt the user to detect the corresponding target category.
In another embodiment of the present disclosure, the sound detection device may be one or more, and the sound detection device may be disposed in an outer mirror on any one side or more sides of the vehicle, and the ambient sound is ambient sound around the vehicle.
Fig. 2 is a schematic view showing a sound detection device provided on a vehicle exterior mirror according to an exemplary embodiment, the vehicle exterior mirror including a mirror plate 201, a mirror housing 202, and a sound detection device 203, as shown in fig. 2, the sound detection device 203 may be installed between the mirror plate 201 and the mirror housing 202. The sound detection means may comprise one or more sound sensors, such as an electrodynamic microphone, a capacitance microphone, or a MEMS microphone.
Therefore, the external environment sound of the vehicle can be conveniently collected, the appearance of the vehicle cannot be influenced, meanwhile, the sound detection device can be prevented from being damaged due to exposure to sunlight and rain, and the service life of the sound detection device is prolonged.
Further, if there are a plurality of sound detection devices, the plurality of sound detection devices may be respectively disposed in the left outer mirror and the right outer mirror of the vehicle.
It should be noted that the left outer reflective mirror may be an outer reflective mirror on the driver side, and the right outer reflective mirror may be an outer reflective mirror on the passenger seat side; the left side outer reflective mirror can be an outer reflective mirror at the side of a passenger seat, and the right side outer reflective mirror can be an outer reflective mirror at the side of a driver. The present disclosure is not limited thereto.
In this way, the sound intensities of the two sides of the vehicle can be detected by the plurality of sound detection devices, so that the direction of the source of the environmental sound can be determined.
Furthermore, a target reflector can be determined according to the source direction, and a target image is displayed through the target reflector.
For example, if the sound intensity detected by the sound detecting device of the left outer mirror is greater than the sound intensity detected by the sound detecting device of the right outer mirror, it may be determined that the environmental sound originates in the left side of the vehicle, the left outer mirror is set as the target mirror, and the target image is displayed on the target mirror.
For another example, if the sound intensity detected by the sound detection device of the right outer mirror is greater than the sound intensity detected by the sound detection device of the left outer mirror, it may be determined that the direction of the environmental sound is the right side of the vehicle, the right outer mirror is used as the target mirror, and the target image is displayed on the target mirror.
In this way, the user may be prompted by an outer mirror that presents the target image of the direction in which the target category may appear.
Further, a plurality of sound detection devices may be respectively provided in the left side outer mirror and the right side outer mirror of the vehicle. For example, the plurality of sound detection devices may be four, two of which are disposed between the case and the mirror plate of the vehicle left side outer mirror, and the other two of which are disposed between the case and the mirror plate of the vehicle right side outer mirror. Each sound detection device may be subjected to a vibration-proof treatment and/or a waterproof treatment. For example, a rubber bag may be wrapped outside each sound detection device to prevent water and shock, and wind noise can be reduced.
Further, the sound detection device may include a data interface, and the data interface may be connected to a vehicle-mounted sound module of the vehicle through a cable, so as to transmit the detected environmental sound to the vehicle-mounted sound module through the data interface, so that the vehicle-mounted sound module performs classification processing on the environmental sound.
In addition, the sound detection device can also comprise a power supply interface and a clock interface, and the power supply interface and the clock interface can be connected with an in-vehicle module of the vehicle through cables so as to provide power supply and clock for the sound detection device.
In another embodiment of the present disclosure, the target sound classification model may be obtained by training according to a target neural network model, where the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
It should be noted that, in the related art, an artificial intelligence model for classifying and recognizing sounds may adopt a large-scale complex neural network model, the large-scale model needs to be run on a server, and the requirement on hardware of the server is high, and the model can be deployed on a cloud server, but is difficult to deploy on a vehicle end or an equipment end. However, the real-time performance of voice recognition through the cloud is not sufficient, and the reliability and timeliness of the voice-assisted automatic driving function are affected. In this embodiment, the target sound classification model is obtained by training according to a target neural network model, where the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model. In this way, the complexity of the target sound classification model can be reduced, and the dependency degree of the model on hardware can be reduced, so that the target sound classification model can be deployed to a vehicle end or an equipment end.
By adopting the method, the environmental sound is collected by the sound detection device; classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound; the target sound classification model is obtained after training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model. Like this, thereby can carry out sound identification through to environmental sound and realize all-round accurate discernment and classification to all ring edge borders object to solve camera or radar and detect the problem that the blind area appears, improved object detection's reliability. In addition, in order to perform sound detection at the vehicle end or the equipment end, the complexity of the target sound classification model can be reduced through model compression, and meanwhile, the sound classification accuracy of the trained target sound classification model is guaranteed through two times of training, so that the target sound classification model can be deployed at the vehicle end or the equipment end, and the timeliness of sound identification and classification is improved.
FIG. 3 is a flowchart illustrating a method of training a target sound classification model according to an exemplary embodiment, which may include, as shown in FIG. 3:
s301, a plurality of sample sounds for training and a sample class corresponding to each sample sound are obtained.
For example, a plurality of sample sounds for training may be obtained from a public sound database, and a sample class may be labeled for each sample sound; or video data can be obtained from a video database, audio data is extracted from the video data as sample sounds, and then each sample sound is labeled with a sample type. The sample class may be a class of sounds such as an alarm sound, a person sound, a cry, a vehicle horn sound, a vehicle hard brake sound, and the like.
Further, the sample category may be a strong label or a weak label of the sample sound. Under the condition that the sample type is a strong label, the sample type appearing in the sample sound and the starting time and the ending time of the appearance of the sample type need to be marked; in the case where the sample type is a weak label, only the sample type appearing in the sample sound may be labeled without labeling the specific start and end times. The weak label can reduce the workload of manual labeling and improve the sample acquisition efficiency.
Further, the sample classes may include a target class and a non-target class. The target class comprises a plurality of target classes obtained by processing the environmental sound through the target sound classification model, and the non-target class represents other classes except the target class, namely the class not output by the target sound classification model.
For example, the target sound classification model is used in a vehicle automatic driving scene, the expected target classes output after classifying the environmental sound may include "alarm sound, human voice, cry, vehicle horn sound and vehicle sudden braking sound", and if the sample classes only include these target classes, since there are many sound classes included in the actual environmental sound, the target sound classification model trained by applying these sample classes may have a problem of false recognition, for example, the wind sound of the environment has a certain similarity to the cry sound, and if the sample of the wind sound is not specially designed for training, the trained target sound classification model may mistakenly recognize the wind sound as the cry sound. In this embodiment, the model can be trained by adding sample sounds of non-target categories, so that sounds before the target categories can be absorbed, the trained model learns finer features, the abstract capability of the model is improved, the distinction between the target categories and the non-target categories is further improved, and the accuracy of sound recognition is also improved.
S302, a preset training step is executed on a preset neural network model according to the sample sound and the sample type, and a first to-be-determined model is obtained.
For example, the sample sound may be subjected to feature extraction according to fourier transform and mel filter, so as to obtain a sample audio feature, where the sample audio feature may include FBANK feature, MFCC feature, PNCC feature, or the like. For example, a 1024-point fourier transform can be calculated every 20ms for a sample sound, the window length is 64ms, and then 64-dimensional FBANK features are obtained through 64 mel filter banks.
The preset training step may be performed by inputting the sample audio features into a preset neural network model, which may be a convolutional neural network in the related art.
The preset neural network model may be a model of a mobile-end convolutional neural network, such as MobileNet, and the mobile-end convolutional neural network may include N convolutional layers, where N may be any positive integer greater than or equal to 5, and N may be 10 or 16, for example.
And S303, carrying out model compression on the first to-be-fixed model to obtain a target neural network model.
For example, the target neural network model may be obtained according to a preset number of convolutional layers of the first to-be-determined model; wherein the predetermined number is smaller than the total number of the convolutional layers of the first to-be-determined model.
S304, the preset training step is executed on the target neural network model according to the sample sound and the sample type, and a second undetermined model is obtained.
For example, the total number of convolutional layers of the first desired model is N, the predetermined number may be M, M is less than N, for example, N is 10, and M may be 5.
In this way, the parameters of the front M layers of the convolutional layers of the first undetermined model obtained after training are used as the initialization parameters of the convolutional layers of the target neural network model. And then executing the preset training step on the target neural network model to obtain a second undetermined model.
S305, determining the target sound classification model according to the second undetermined model.
Illustratively, the second pending model may be taken as the target sound classification model.
By adopting the mode, according to the plurality of sample sounds and the sample category corresponding to each sample sound, the preset training step is carried out on the preset neural network model to obtain a first model to be fixed, and the model compression is carried out on the first model to be fixed to obtain a target neural network model; performing a preset training step on the target neural network model according to the sample sound and the sample category to obtain a second undetermined model; and determining a target sound classification model according to the second undetermined model. Therefore, through model compression and twice training, the complexity of a target sound classification model obtained after training can be reduced, and the accuracy of the target sound classification model on sound classification can be ensured, so that the obtained target sound classification model is more simplified and efficient, the dependence degree on hardware is reduced, and the deployment difficulty of a vehicle-mounted end or an equipment end is reduced.
Further, in the step S305, a model quantization process may be performed on the model parameter of the second to-be-determined model to obtain the target sound classification model.
The model quantization processing may include model parameter compression, for example, the parameters of the model may be quantized to a preset number of bits, where the preset number of bits may be 8 bits or 16 bits, for example, all floating point parameters are quantized and compressed to integer type parameters, so that the size of the model may be further reduced under the condition that the performance of the model is guaranteed to be substantially unchanged, the operation power consumption of the model is reduced, the obtained target sound classification model is more simplified and efficient, the degree of dependence on hardware is reduced, and the difficulty in deployment of a vehicle-mounted end or an equipment end is reduced.
Further, the preset training step may include the following steps:
and circularly executing the model training step to train the target model until the trained target model meets a preset iteration stopping condition according to the sample class and the prediction class, wherein the target model comprises a preset neural network model or a target neural network model, and the prediction class is the class output after the sample sound is input into the trained target model.
It should be noted that the preset iteration stop condition may be a condition for stopping iteration, which is commonly used in the prior art, for example, a condition that a similarity difference between a sample category and a prediction category is smaller than a preset similarity difference threshold, and the disclosure does not limit this.
The model training step comprises:
s11, obtaining the similarity between the sample sound and a plurality of first samples of the sample category.
Exemplarily, the sample sound is subjected to feature extraction according to a preset period to obtain sample features of a plurality of periods; then, according to the sample characteristics of a plurality of periods, the similarity of the sample sound and a plurality of first samples of the sample category is obtained.
For example, the sample sound may be any sample audio data greater than 5 seconds, the preset period may be any time between 20 milliseconds and 2 seconds, for example, the preset period may be 1 second or 500 milliseconds, the sample audio data may be divided according to the preset period, the feature extraction may be performed on the divided audio segments, and the sample feature of each divided audio segment may be obtained.
Then, according to the sample characteristics of the plurality of cycles, acquiring the similarity of the sample sound and the first sample of the plurality of sample categories may include any one of the following similarity acquisition manners first and second, where:
the first similarity obtaining method may include the following steps:
first, for a sample feature of each period, a first feature code corresponding to the sample feature is obtained.
Then, a second feature code of the sample sound is calculated according to the first feature codes of the plurality of sample features.
And finally, acquiring the similarity of the sample sound and a plurality of first samples of the sample class according to the second feature codes.
For example, an average value of the plurality of first feature codes may be used as the second feature code, that is, an Embedding layer (Embedding) before the output layer is averaged, and then the first similarity between the sample sound and the plurality of sample categories may be obtained according to the averaged second feature code. For example, the similarity between the second feature code and the sample feature code corresponding to each sample category may be calculated, and the similarity may be used as the similarity between the sample sound and the first sample of each sample category.
The similarity obtaining method is adopted to obtain the first sample similarity, the target sound classification model obtained after the model is trained is high in accuracy of recognizing the environmental sound basically consistent with the time length of the sample sound, and in order to improve the real-time performance and accuracy of environmental sound recognition, the sample sound with short time length can be obtained for training in the model which is trained by adopting the similarity obtaining method I.
The second similarity obtaining method may include the following steps:
firstly, acquiring a first feature code corresponding to a sample feature for each period; and obtaining the similarity of the sample characteristics and second samples of a plurality of sample categories according to the first characteristic codes.
Then, according to the second sample similarity of the plurality of sample characteristics, the first sample similarity of the sample sound and the plurality of sample categories is calculated.
For example, the sample feature of each period may be input into the convolutional layer, a first feature code corresponding to the sample feature is obtained, the similarity between the first feature code and the sample feature code corresponding to each sample class is calculated, and the similarity is used as the second sample similarity between the sample feature and each sample class.
Then, for each sample class, an average value of the similarity between the plurality of sample features of the sample sound and the second sample of the sample class is calculated as the similarity between the sample sound and the second sample of the sample class.
The method is adopted to obtain the first sample similarity, the requirement on the length of the sample sound for training is not high, and the similarity calculation is carried out on the audio frequency obtained after the sample sound is segmented, so that the timeliness and the accuracy of the trained target sound recognition model for recognizing the environmental sound can be improved even if the sample sound is long in time.
And S12, determining a prediction type corresponding to the sample sound from a plurality of sample types according to the first sample similarity.
And S13, under the condition that the target model after training is determined to not meet the preset iteration stopping condition according to the sample type and the prediction type, determining a target loss value according to the sample type and the prediction type, updating parameters of the target model according to the target loss value to obtain the target model after training, and taking the target model after training as a new target model.
Similarly, the preset iteration stopping condition may be a condition for stopping iteration commonly used in the prior art, for example, a condition that a similarity difference between the sample class and the prediction class is smaller than a preset similarity difference threshold, which is not limited in this disclosure.
Therefore, through the preset training step, the target model can be trained, and the accuracy of the target type obtained by recognizing the environmental sound by the trained target template can be improved.
Fig. 4 is a flowchart illustrating an S102 step according to the embodiment shown in fig. 1, where as shown in fig. 4, the S102 step may include:
s1021, inputting the environmental sound into the target sound classification model to obtain one or more first candidate classes and first target similarity between the environmental sound and each first candidate class;
and S1022, determining a target category from the first candidate categories according to the first target similarity.
For example, the first candidate category with the first target similarity ranked from high to low may be taken as the target category; the first candidate category of which the first target similarity is greater than or equal to a preset similarity threshold value can also be taken as a target category; the first candidate category of which the first target similarity is ranked from high to low by N digits and is greater than or equal to a preset similarity threshold may also be used as the target category.
In another embodiment of the present disclosure, the step S1022 may include:
first, a first candidate category, which is ranked from high similarity to low similarity and has a similarity greater than or equal to a preset similarity threshold, is used as a second candidate category.
The target class is then determined based on the second candidate class.
Wherein the second candidate category may be one or more.
If the second candidate category is one, the second candidate category may be directly used as the target category.
If the second candidate categories are multiple, the multiple second candidate categories can be directly used as target categories; the second candidate category with the greatest similarity to the first target may also be taken as the target category.
Further, if there are a plurality of second candidate categories, the target category may be determined by:
firstly, according to the preset category corresponding relation, the category relation between each second candidate category and other second candidate categories is determined.
The preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation. Confusion relationships are used to characterize confusing classes that are non-homogeneous between the two second candidate classes, such as "wind" and "crying"; the kindred relationship is used for characterizing the category under the same scene between two second candidate categories, such as 'crying' and 'human voice'.
Then, the target category is determined according to the second candidate category and the category relation.
For example, if only the second candidate category with the same-class relationship is included in the plurality of second candidate categories, the plurality of second candidate categories may be directly used as the target categories, or the second candidate category with the highest similarity to the first target may be used as the target category.
For another example, if the plurality of second candidate categories include a second candidate category of the confusion relationship, the confusion coefficients of the plurality of second candidate categories may be calculated, and when the confusion coefficients are less than or equal to a preset confusion threshold, the plurality of second candidate categories are taken as the target categories, or the second candidate category with the largest first target similarity is taken as the target category; and under the condition that the confusion coefficient is larger than the preset confusion threshold, the target class is not output.
The confusion factor may be a ratio of the number of confusion relationships to the total number of category relationships of the second candidate categories. Illustratively, the number of the second candidate categories is 5, and a category relationship exists between each two second candidate categories, and the total number of the category relationships is 6; where the class relationship is 3 in number, the aliasing coefficient may be 0.5. The preset confusion threshold may be 0.7, so that the confusion factor is smaller than the preset confusion threshold, and therefore, the plurality of second candidate categories may be used as the target categories, or the second candidate category with the largest similarity to the first target may be used as the target category.
By the method, the identification accuracy of the model can be determined according to the confusion relation of the identified candidate categories, and under the condition that the confusion relation meets the preset condition (the confusion coefficient is less than or equal to the preset confusion threshold), the identification accuracy of the model meets the condition, so that the acquired target category is more accurate.
Fig. 5 is a block diagram illustrating a voice recognition apparatus 500 according to an exemplary embodiment, and as shown in fig. 5, the apparatus 500 may include:
a sound collection module 501 configured to collect an ambient sound by a sound detection device;
a sound classification module 502 configured to classify the environmental sound according to a target sound classification model, so as to obtain a target class corresponding to the environmental sound;
a presentation module 503 configured to present the object category through a presentation apparatus.
Optionally, the presentation module 503 is configured to determine a target image corresponding to the target category; and displaying the target image through the display device.
Optionally, the presentation module 503 is configured to determine a target image corresponding to the target category according to a category image corresponding relationship, where the category image corresponding relationship includes a corresponding relationship between the target category and the target image.
Optionally, the display module 503 is configured to display the target image in a preset area of the display apparatus.
Optionally, the display device includes one or more of an exterior mirror, an interior mirror, and a center screen of the vehicle.
Optionally, in the case that the display device includes an external mirror of the vehicle, the display module 503 is configured to display the target image through the external mirrors on both sides of the vehicle in the case that it is detected that the passenger seat is occupied by a passenger; alternatively, in a case where it is not detected that a passenger is seated in the passenger seat, the target image is displayed by an external mirror on the driver's side of the vehicle.
Optionally, the sound detection device is one or more, the sound detection device is arranged in the outer reflector on any one side or multiple sides of the vehicle, and the environment sound is the environment sound around the vehicle.
Alternatively, in the case where the sound detection device is plural, the plural sound detection devices are provided in the left outer mirror and the right outer mirror of the vehicle, respectively.
Optionally, the sound classification module 502 is configured to input the environmental sound into the target sound classification model, to obtain one or more first candidate categories, and a first target similarity between the environmental sound and each first candidate category; determining the target class from the first candidate class according to the first target similarity.
Optionally, the sound classification module 502 is configured to rank the first target similarity from large to small by N top, and the first target similarity is greater than or equal to a preset similarity threshold, as a first candidate category; the target class is determined according to the second candidate class.
Optionally, the second candidate categories are multiple, and the sound classification module 502 is configured to determine a category relationship between each second candidate category and other second candidate categories according to a preset category correspondence; the preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation; and determining the target category according to the second candidate category and the category relation.
Optionally, the target sound classification model is obtained by training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
Fig. 6 is a block diagram illustrating another voice recognition apparatus according to an exemplary embodiment, and as shown in fig. 6, the apparatus may further include a model training module 601, where the model training module 601 is configured to:
acquiring a plurality of sample sounds for training and a sample class corresponding to each sample sound;
performing a preset training step on a preset neural network model according to the sample sound and the sample category to obtain a first model to be fixed;
carrying out model compression on the first model to be fixed to obtain a target neural network model;
executing the preset training step on the target neural network model according to the sample sound and the sample type to obtain a second undetermined model;
and determining the target sound classification model according to the second undetermined model.
Optionally, the model training module 601 is configured to obtain the target neural network model according to a preset number of convolutional layers of the first to-be-determined model; wherein the predetermined number is smaller than the total number of the convolutional layers of the first to-be-determined model.
Optionally, the model training module 601 is configured to perform a model training step in a loop to train a target model until it is determined that the trained target model meets a preset iteration stop condition according to the sample class and a prediction class, where the target model includes a preset neural network model or the target neural network model, and the prediction class is a class output after the sample sound is input into the trained target model;
the model training step comprises:
obtaining similarity of the sample sound and a plurality of first samples of the sample categories;
determining a prediction type corresponding to the sample sound from a plurality of sample types according to the first sample similarity;
and under the condition that the trained target model does not meet the preset iteration stopping condition according to the sample class and the prediction class, determining a target loss value according to the sample class and the prediction class, updating parameters of the target model according to the target loss value to obtain the trained target model, and taking the trained target model as a new target model.
Optionally, the model training module 601 is configured to perform feature extraction on the sample sound according to a preset period, so as to obtain sample features of multiple periods; and according to the sample characteristics of a plurality of periods, obtaining the similarity of the sample sound and a plurality of first samples of the sample category.
Optionally, the model training module 601 is configured to, for a sample feature of each cycle, obtain a first feature code corresponding to the sample feature; according to the first feature code, obtaining the similarity of the sample feature and second samples of a plurality of sample categories; and calculating the similarity of the sample sound and the first samples of a plurality of sample categories according to the second sample similarity of a plurality of sample characteristics.
Optionally, the sample class includes the target class and a non-target class.
Optionally, the model training module 601 is configured to perform model quantization processing on the model parameters of the second undetermined model to obtain the target sound classification model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
In summary, by adopting the device in the above embodiment of the present disclosure, the environmental sound is collected by the sound detection device; classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound; the target sound classification model is obtained after training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model. Like this, thereby can carry out sound identification through to environmental sound and realize all-round accurate discernment and classification to all ring edge borders object to solve camera or radar and detect the problem that the blind area appears, improved object detection's reliability. In addition, in order to perform sound detection at the vehicle end or the equipment end, the complexity of the target sound classification model can be reduced through model compression, and meanwhile, the sound classification accuracy of the trained target sound classification model is guaranteed through two times of training, so that the target sound classification model can be deployed at the vehicle end or the equipment end, and the timeliness of sound identification and classification is improved.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the sound recognition method provided by the present disclosure.
Fig. 7 is a block diagram of an electronic device 900 shown in accordance with an example embodiment. For example, the electronic device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, a router, a vehicle terminal, and so forth.
Referring to fig. 7, electronic device 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communications component 916.
The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the voice recognition method described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operation at the electronic device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power component 906 provides power to the various components of the electronic device 900. Power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 900.
The multimedia components 908 include a screen that provides an output interface between the electronic device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 914 includes one or more sensors for providing status evaluations of various aspects of the electronic device 900. For example, sensor assembly 914 may detect an open/closed state of electronic device 900, the relative positioning of components, such as a display and keypad of electronic device 900, sensor assembly 914 may also detect a change in the position of electronic device 900 or a component of electronic device 900, the presence or absence of user contact with electronic device 900, orientation or acceleration/deceleration of electronic device 900, and a change in the temperature of electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate wired or wireless communication between the electronic device 900 and other devices. The electronic device 900 may access a wireless network based on a communication standard, such as Wi-Fi,2G, 3G, 4G, 5G, NB-IOT, eMTC, or other 6G, or the like, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described voice recognition methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the electronic device 900 to perform the voice recognition method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned sound recognition method when executed by the programmable apparatus.
Fig. 8 is a block diagram illustrating a vehicle, according to an example embodiment, which may include the electronic device 900 described above, as shown in fig. 8.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (20)

1. A method of voice recognition, the method comprising:
collecting environmental sound through a sound detection device;
classifying the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound;
displaying the target category through a display device;
the classifying the environmental sound according to the target sound classification model to obtain a target class corresponding to the environmental sound comprises:
inputting the environmental sound into the target sound classification model to obtain one or more first candidate classes and a first target similarity between the environmental sound and each first candidate class;
determining the target class from the first candidate class according to the first target similarity;
the determining the target class from the first candidate class according to the first target similarity comprises:
ranking the first target similarity from big to small by N, and taking the first candidate category of which the first target similarity is greater than or equal to a preset similarity threshold as a second candidate category;
determining the target class according to the second candidate class;
the second candidate category is multiple, and the determining the target category according to the second candidate category includes:
determining the category relation between each second candidate category and other second candidate categories according to the preset category corresponding relation; the preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation;
and determining the target category according to the second candidate category and the category relation.
2. The method of claim 1, wherein said presenting the object class by a presentation device comprises:
determining a target image corresponding to the target category;
and displaying the target image through the display device.
3. The method of claim 2, wherein the determining the target image corresponding to the target category comprises:
and determining a target image corresponding to the target category according to a category image corresponding relation, wherein the category image corresponding relation comprises the corresponding relation between the target category and the target image.
4. The method of claim 2, wherein said presenting the target image by the presentation device comprises:
and displaying the target image in a preset area of the display device.
5. The method of claim 2, wherein the display device comprises one or more of an exterior mirror, an interior mirror, and a center screen of the vehicle.
6. The method of claim 5, wherein in the case where the presentation device comprises an exterior mirror of a vehicle, the presenting the target image by the presentation device comprises:
under the condition that a passenger is detected to be seated in the passenger seat, displaying the target image through outer reflectors on two sides of the vehicle; alternatively, the first and second electrodes may be,
and in the case that the passenger is not detected to be seated in the passenger seat, displaying the target image through an external reflector on the side of the driver of the vehicle.
7. The method of claim 1, wherein the sound detection device is one or more and is disposed in an outer mirror on any one or more sides of a vehicle, and the ambient sound is ambient sound in the surroundings of the vehicle.
8. The method according to claim 7, wherein in the case where the sound detection device is plural, the plural sound detection devices are respectively provided in a left side outer mirror and a right side outer mirror of the vehicle.
9. The method according to any one of claims 1 to 8, wherein the target sound classification model is obtained after training according to a target neural network model, and the target neural network model is obtained by training a preset neural network model and performing model compression on the trained preset neural network model.
10. The method of claim 9, wherein the target sound classification model is trained by:
obtaining a plurality of sample sounds for training and a sample class corresponding to each sample sound;
performing a preset training step on a preset neural network model according to the sample sound and the sample category to obtain a first model to be fixed;
performing model compression on the first model to be fixed to obtain a target neural network model;
executing the preset training step on the target neural network model according to the sample sound and the sample category to obtain a second undetermined model;
and determining the target sound classification model according to the second undetermined model.
11. The method of claim 10, wherein the model compressing the first to-be-modeled model to obtain a target neural network model comprises:
acquiring the target neural network model according to the preset number of convolutional layers of the first to-be-determined model; the preset number is smaller than the total number of the convolutional layers of the first to-be-determined model.
12. The method of claim 10, wherein the pre-training step comprises:
training a target model by circularly executing a model training step until the trained target model meets a preset iteration stopping condition according to the sample class and a prediction class, wherein the target model comprises a preset neural network model or the target neural network model, and the prediction class is a class output after the sample sound is input into the trained target model;
the model training step comprises:
obtaining a first sample similarity of the sample sound and a plurality of sample categories;
determining a prediction category corresponding to the sample sound from a plurality of sample categories according to the first sample similarity;
and under the condition that the trained target model does not meet the preset iteration stopping condition according to the sample category and the prediction category, determining a target loss value according to the sample category and the prediction category, updating parameters of the target model according to the target loss value to obtain the trained target model, and taking the trained target model as a new target model.
13. The method of claim 12, wherein obtaining the first similarity of the sample sound to the plurality of sample classes comprises:
carrying out feature extraction on the sample sound according to a preset period to obtain sample features of multiple periods;
and according to the sample characteristics of a plurality of periods, obtaining the similarity of the sample sound and the first samples of a plurality of sample categories.
14. The method of claim 13, wherein obtaining the first similarity of the sample sound to the samples of the sample classes according to the sample features of the cycles comprises:
acquiring a first feature code corresponding to the sample feature for the sample feature of each period; according to the first feature codes, obtaining the similarity of the sample features and second samples of a plurality of sample categories;
and calculating to obtain the first sample similarity of the sample sound and the plurality of sample categories according to the second sample similarity of the plurality of sample characteristics.
15. The method of claim 10, wherein the sample classes include the target class and a non-target class.
16. The method of claim 10, wherein determining the target sound classification model according to the second pending model comprises:
and carrying out model quantization processing on the model parameters of the second undetermined model to obtain the target sound classification model.
17. A voice recognition apparatus, characterized in that the apparatus comprises:
a sound collection module configured to collect an ambient sound by a sound detection device;
the sound classification module is configured to classify the environmental sound according to a target sound classification model to obtain a target class corresponding to the environmental sound;
a presentation module configured to present the object category through a presentation device;
the sound classification module is configured to input the environmental sound into the target sound classification model, so as to obtain one or more first candidate categories and a first target similarity between the environmental sound and each first candidate category; determining the target class from the first candidate class according to the first target similarity;
the sound classification module is configured to take a first candidate category as a second candidate category, wherein the first candidate category is ranked from big to small by N, and the similarity of the first target is greater than or equal to a preset similarity threshold; determining the target class according to the second candidate class;
the sound classification module is configured to determine a category relationship between each second candidate category and other second candidate categories according to a preset category correspondence; the preset category corresponding relation comprises a category relation between any two second candidate categories, and the category relation comprises a confusion relation and a homogeneous relation; and determining the target category according to the second candidate category and the category relation.
18. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the steps of the method of any one of claims 1 to 16.
19. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 16.
20. A vehicle characterized in that it comprises an electronic device according to claim 18.
CN202210055284.9A 2022-01-18 2022-01-18 Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle Active CN114420163B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210055284.9A CN114420163B (en) 2022-01-18 2022-01-18 Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle
PCT/CN2022/090554 WO2023137908A1 (en) 2022-01-18 2022-04-29 Sound recognition method and apparatus, medium, device, program product and vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210055284.9A CN114420163B (en) 2022-01-18 2022-01-18 Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle

Publications (2)

Publication Number Publication Date
CN114420163A CN114420163A (en) 2022-04-29
CN114420163B true CN114420163B (en) 2023-04-07

Family

ID=81273884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210055284.9A Active CN114420163B (en) 2022-01-18 2022-01-18 Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle

Country Status (2)

Country Link
CN (1) CN114420163B (en)
WO (1) WO2023137908A1 (en)

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2897745Y (en) * 2004-08-03 2007-05-09 彭小毛 Voice transmission inside and outside car
US9873428B2 (en) * 2015-10-27 2018-01-23 Ford Global Technologies, Llc Collision avoidance using auditory data
CN107293308B (en) * 2016-04-01 2019-06-07 腾讯科技(深圳)有限公司 A kind of audio-frequency processing method and device
US10276187B2 (en) * 2016-10-19 2019-04-30 Ford Global Technologies, Llc Vehicle ambient audio classification via neural network machine learning
CN106504768B (en) * 2016-10-21 2019-05-03 百度在线网络技术(北京)有限公司 Phone testing audio frequency classification method and device based on artificial intelligence
US10747231B2 (en) * 2017-11-17 2020-08-18 Intel Corporation Identification of audio signals in surrounding sounds and guidance of an autonomous vehicle in response to the same
DE102018200054A1 (en) * 2018-01-03 2019-07-04 Ford Global Technologies, Llc Device for blind spot monitoring of a motor vehicle
US20200184991A1 (en) * 2018-12-05 2020-06-11 Pascal Cleve Sound class identification using a neural network
US11567510B2 (en) * 2019-01-24 2023-01-31 Motional Ad Llc Using classified sounds and localized sound sources to operate an autonomous vehicle
KR102566412B1 (en) * 2019-01-25 2023-08-14 삼성전자주식회사 Apparatus for controlling driving of vehicle and method thereof
CN110047512B (en) * 2019-04-25 2021-04-16 广东工业大学 Environmental sound classification method, system and related device
CN110348572B (en) * 2019-07-09 2022-09-30 上海商汤智能科技有限公司 Neural network model processing method and device, electronic equipment and storage medium
CN110414406A (en) * 2019-07-23 2019-11-05 广汽蔚来新能源汽车科技有限公司 Interior object monitoring and managing method, device, system, car-mounted terminal and storage medium
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN111898484A (en) * 2020-07-14 2020-11-06 华中科技大学 Method and device for generating model, readable storage medium and electronic equipment
CN112339760A (en) * 2020-11-06 2021-02-09 广州小鹏汽车科技有限公司 Vehicle travel control method, control device, vehicle, and readable storage medium
CN113183901B (en) * 2021-06-03 2022-11-22 亿咖通(湖北)技术有限公司 Vehicle-mounted cabin environment control method, vehicle and electronic equipment

Also Published As

Publication number Publication date
WO2023137908A1 (en) 2023-07-27
CN114420163A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
US9049352B2 (en) Pool monitor systems and methods
US9013575B2 (en) Doorbell communication systems and methods
JP6251906B2 (en) Smartphone sensor logic based on context
CN110895861B (en) Abnormal behavior early warning method and device, monitoring equipment and storage medium
US20160044287A1 (en) Monitoring systems and methods
US10212778B1 (en) Face recognition systems with external stimulus
CN110751659B (en) Image segmentation method and device, terminal and storage medium
US11195408B1 (en) Sending signals for help during an emergency event
CN111968635B (en) Speech recognition method, device and storage medium
EP4191579A1 (en) Electronic device and speech recognition method therefor, and medium
CN106650603A (en) Vehicle surrounding monitoring method, apparatus and the vehicle
CN110310618A (en) Processing method, processing unit and the vehicle of vehicle running environment sound
CN111899760A (en) Audio event detection method and device, electronic equipment and storage medium
WO2017113078A1 (en) Switching method and portable electronic device
CN110091877A (en) Control method, system and vehicle for safe driving of vehicle
CN114764911B (en) Obstacle information detection method, obstacle information detection device, electronic device, and storage medium
CN113066048A (en) Segmentation map confidence determination method and device
CN111435422B (en) Action recognition method, control method and device, electronic equipment and storage medium
CN114420163B (en) Voice recognition method, voice recognition device, storage medium, electronic device, and vehicle
CN113269307A (en) Neural network training method and target re-identification method
CN110059619B (en) Automatic alarm method and device based on image recognition
CN112061024A (en) Vehicle external speaker system
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN115312068B (en) Voice control method, equipment and storage medium
CN110194181A (en) Drive support method, vehicle and drive assist system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant