CN110473568A - Scene recognition method, device, storage medium and electronic equipment - Google Patents

Scene recognition method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110473568A
CN110473568A CN201910731749.6A CN201910731749A CN110473568A CN 110473568 A CN110473568 A CN 110473568A CN 201910731749 A CN201910731749 A CN 201910731749A CN 110473568 A CN110473568 A CN 110473568A
Authority
CN
China
Prior art keywords
scene
channel audio
classification
double
frequency signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910731749.6A
Other languages
Chinese (zh)
Other versions
CN110473568B (en
Inventor
宋天龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinsheng Communication Technology Co Ltd, Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Shanghai Jinsheng Communication Technology Co Ltd
Priority to CN201910731749.6A priority Critical patent/CN110473568B/en
Publication of CN110473568A publication Critical patent/CN110473568A/en
Application granted granted Critical
Publication of CN110473568B publication Critical patent/CN110473568B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the present application discloses a kind of scene recognition method, device, storage medium and electronic equipment, wherein, the embodiment of the present application collects the double-channel audio frequency signal of scene to be identified first, then pass through the prediction scheme 2 of prediction scheme 1 and the channel audio signal synthesized based on double-channel audio frequency signal based on double-channel audio frequency signal, two alternate scenes classification results of scene to be identified are acquired, then merges two alternate scenes classification results and obtains the target scene classification results of scene to be identified.As a result, without realizing the identification to scene locating for electronic equipment in conjunction with location technology, also just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application more flexible accurately can be identified scene to be identified locating for electronic equipment.

Description

Scene recognition method, device, storage medium and electronic equipment
Technical field
This application involves scene Recognition technical fields, and in particular to a kind of scene recognition method, device, storage medium and electricity Sub- equipment.
Background technique
Currently, as the electronic equipments such as tablet computer, mobile phone can be by scene locating for analysis user, based on the analysis results It carries out corresponding processing operation, thus promotes user experience.In the related technology, electronic equipment scene locating for analysis user When, it is usually realized using GPS positioning, i.e., obtains current location information by GPS positioning, determined according to the location information Scene locating for electronic equipment that is to say scene locating for user.However, in the environment of interior or more veil, The relevant technologies are difficult to realize GPS positioning, also can not just identify to the environment scene locating for electronic equipment.
Summary of the invention
The embodiment of the present application provides a kind of scene recognition method, device, storage medium and electronic equipment, can be to electronics Environment scene locating for equipment is identified.
In a first aspect, a kind of scene recognition method for providing of the embodiment of the present application, is applied to electronic equipment, the electronics Equipment includes two microphones, which includes:
Audio collection is carried out to scene to be identified by described two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of the double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and is called pre- First the first scene classification model of training is based on first acoustic feature and carries out scene classification, obtains the classification of the first alternate scenes As a result;
Audio synthesis processing is carried out to the double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of the channel audio signal is extracted according to the second default feature extraction strategy, and is called pre- First the second scene classification model of training is based on second acoustic feature and carries out scene classification, obtains the classification of the second alternate scenes As a result;
According to the first alternate scenes classification results and the second alternate scenes classification results, obtain described wait know The target scene classification results of other scene.
Second aspect, a kind of scene Recognition device for providing of the embodiment of the present application are applied to electronic equipment, the electronics Equipment includes two microphones, which includes:
Audio collection module obtains bilateral for carrying out audio collection to scene to be identified by described two microphones Audio channel signal;
First categorization module, for extracting the first of the double-channel audio frequency signal according to the first default feature extraction strategy Acoustic feature, and call the first scene classification model of training to be in advance based on first acoustic feature and carry out scene classification, it obtains To the first alternate scenes classification results;
Audio synthesis module obtains single channel audio for carrying out audio synthesis processing to the double-channel audio frequency signal Signal;
Second categorization module, for extracting the second of the channel audio signal according to the second default feature extraction strategy Acoustic feature, and call the second scene classification model of training to be in advance based on second acoustic feature and carry out scene classification, it obtains To the second alternate scenes classification results;
Module is integrated in classification, for being classified according to the first alternate scenes classification results and second alternate scenes As a result, obtaining the target scene classification results of the scene to be identified.
The third aspect, storage medium provided by the embodiments of the present application, is stored thereon with computer program, when the computer The scene recognition method provided such as the application any embodiment is provided when program is called by processor.
Fourth aspect, electronic equipment provided by the embodiments of the present application, including processor and memory, the memory have meter Calculation machine program, the processor is by calling the computer program, for executing the field provided such as the application any embodiment Scape recognition methods.
The embodiment of the present application collects the double-channel audio frequency signal of scene to be identified first, then by being based on binary channels The prediction scheme 2 of the prediction scheme 1 of audio signal and the channel audio signal synthesized based on double-channel audio frequency signal, Acquire two alternate scenes classification results of scene to be identified, then merge two alternate scenes classification results obtain it is to be identified The target scene classification results of scene.As a result, without realizing the identification to scene locating for electronic equipment in conjunction with location technology, Just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application can be more flexible accurately to electronics Scene to be identified locating for equipment is identified.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a flow diagram of scene recognition method provided by the embodiments of the present application.
Fig. 2 is the setting schematic diagram of two microphones of electronic equipment in the embodiment of the present application.
Fig. 3 is to be predicted to obtain target candidate scene according to the double-channel audio frequency signal of scene to be identified in the embodiment of the present application The schematic diagram of classification results.
Fig. 4 is the exemplary diagram of the scenetype information input interface provided in the embodiment of the present application.
Fig. 5 is the schematic diagram that mel-frequency cepstrum coefficient is extracted in the embodiment of the present application.
Fig. 6 is the schematic diagram that each channel energy regularization feature is extracted in the embodiment of the present application.
Fig. 7 is another flow diagram of scene recognition method provided by the embodiments of the present application.
Fig. 8 is a structural schematic diagram of scene Recognition device provided by the embodiments of the present application.
Fig. 9 is a structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Figure 10 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Schema is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one It is illustrated in computing environment appropriate.The following description is the application specific embodiment illustrated by, should not be by It is considered as limitation the application other specific embodiments not detailed herein.
The embodiment of the present application provides a kind of scene recognition method, and the executing subject of the scene recognition method can be the application The scene Recognition device that embodiment provides, or it is integrated with the electronic equipment of the scene Recognition device, wherein the scene Recognition fills Setting can be realized by the way of hardware or software.Wherein, electronic equipment can be smart phone, tablet computer, palm electricity The equipment such as brain, laptop or desktop computer.
Fig. 1 is please referred to, Fig. 1 is the flow diagram of scene recognition method provided by the embodiments of the present application, and the application is implemented The detailed process for the scene recognition method that example provides can be such that
In 101, audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal.
Wherein, scene to be identified can be the scene that electronic equipment is presently in.
It should be noted that electronic equipment includes two microphones, wherein two microphones can included by electronic equipment To be built-in microphone, being also possible to external microphone (can be wired microphone, is also possible to wireless Mike Wind), the embodiment of the present application is not particularly limited this.For example, referring to figure 2., electronic equipment includes two and is arranged back-to-back Microphone is respectively arranged in the microphone 1 of electronic equipment lower side and the microphone 2 of side on an electronic device is arranged, In, downward, the pickup hole of microphone 2 is upward in the pickup hole of microphone 1.In addition, two microphones can set by electronic equipment Think non-directive microphone (in other words, omni-directional microphone).
In the embodiment of the present application, electronic equipment passes through two microphones first and carries out audio collection to scene to be identified, than Such as, when the scene being presently in is set as scene to be identified, electronic equipment can be synchronized by two microphones to current institute The scene at place carries out audio collection, obtains the identical double-channel audio frequency signal of duration.
It should be noted that assuming that microphone included by electronic equipment is simulation microphone, then simulation will be collected Audio signal, need the audio signal that will be simulated to carry out analog-to-digital conversion at this time, obtain digitized audio signal, for subsequent Processing.For example, electronic equipment can after collecting the two-way analog audio signal of acquisition to be identified by two microphones, with The sample frequency of 16KHz respectively samples two-way analog audio signal, obtains the digitized audio signal of two-way.
One of ordinary skill in the art will appreciate that if microphone included by electronic equipment is digital microphone, So digitized audio signal will be directly collected, no longer needs to carry out analog-to-digital conversion.
In 102, the first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and adjust The first acoustic feature is based on the first scene classification model of training in advance and carries out scene classification, obtains the classification of the first alternate scenes As a result.
It should be noted that training has the first scene classification model and the second scene classification mould in advance in the embodiment of the present application Type, wherein the type of the first scene classification model and the second scene classification model is different, and the first scene classification model is with bilateral The acoustic feature in road is input, and the second scene classification model is inputted with single pass acoustic feature, acoustics of the two based on input The scene classification result that feature is predicted is output.
Correspondingly, electronic equipment is after the double-channel audio frequency signal for collecting scene to be identified, it is default according to first Feature extraction strategy extracts to obtain the first acoustic feature of double-channel audio frequency signal, is twin-channel acoustic feature.Later, electronics Equipment will be extracted the first acoustic feature and is input in the first scene classification model of training in advance, by the first scene classification model The first acoustic feature based on input predicts the scene type of scene to be identified.Later, electronic equipment is by the first scene First alternate scenes classification results of the scene classification result of disaggregated model prediction output as scene to be identified.
In 103, audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal.
In the embodiment of the present application, electronic equipment also carries out audio synthesis processing to double-channel audio frequency signal, by binary channels sound Frequency signal synthesizes channel audio signal.For example, the average value of double-channel audio frequency signal can be taken, single channel audio letter is obtained Number.
It should be noted that 102 and 103 execution sequence is not influenced by serial number size, it can be and executing completion 102 It executes 103 again afterwards, is also possible to execute 102 again after executing completion 103 and 104, is also possible to 102 and 103 and is performed simultaneously.
In 104, the second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and adjusted The second acoustic feature is based on the second scene classification model of training in advance and carries out scene classification, obtains the classification of the second alternate scenes As a result.
As described above, in the embodiment of the present application also training have the second scene classification model, the second scene classification model with Single pass audio frequency characteristics are input.
Correspondingly, electronic equipment is after the double-channel audio frequency signal according to acquisition synthesizes to obtain channel audio signal, The second acoustic feature that the channel audio signal that synthesis obtains is extracted according to the second default feature extraction strategy is single pass Acoustic feature.Later, electronic equipment will extract the second acoustic feature and be input in the second scene classification model of training in advance, The scene type of scene to be identified is predicted based on the second acoustic feature of input by the second scene classification model.Later, The scene classification result that electronic equipment exports the second scene classification model prediction is as the second alternate scenes of scene to be identified Classification results.
In 105, according to the first alternate scenes classification results and the second alternate scenes classification results, field to be identified is obtained The target scene classification results of scape.
In the embodiment of the present application, electronic equipment in the first alternate scenes classification results for acquiring scene to be identified and It, can be according to the first alternate scenes classification results and the second alternate scenes classification knot after second alternate scenes classification results Fruit acquires the target scene classification results of scene to be identified.For example, electronic equipment can be with the first alternate scenes classification results The higher alternate scenes classification results of probability value corresponding with the second alternate scenes classification results are set as the mesh to scene to be identified Mark scene classification result.
In addition, electronic equipment after acquiring the target scene classification results of scene to be identified, can also be performed pair Should target scene classification results predetermined registration operation, for example, the target scene classification results for getting scene to be identified be " When iron scene ", electronic equipment can configure audio output parameters to the audio output ginseng of pre-set corresponding subway scene Number.
As shown in figure 3, collecting the double-channel audio frequency signal of scene to be identified first, then in the embodiment of the present application Pass through the prediction scheme 1 based on double-channel audio frequency signal and the single channel audio letter synthesized based on double-channel audio frequency signal Number prediction scheme 2, acquire two alternate scenes classification results of scene to be identified, then merge the classification of two alternate scenes As a result the target scene classification results of scene to be identified are obtained.As a result, without realizing in conjunction with location technology to electronic equipment institute Locate the identification of scene, also just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application can be cleverer Work accurately identifies scene to be identified locating for electronic equipment.
In one embodiment, " audio synthesis processing is carried out to double-channel audio frequency signal, obtain channel audio signal ", packet It includes:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
It, can be by the way of Wave beam forming by the single-pass of double-channel audio frequency signal synthesis dimension in the embodiment of the present application Audio channel signal.Wherein, electronic equipment can be according to default beamforming algorithm to the binary channels for collecting scene to be identified Audio signal carries out Wave beam forming, obtains enhanced channel audio signal, the enhanced single channel audio obtained as a result, Retain the sound in original double-channel audio frequency signal from specific direction in signal, can more accurately characterize field to be identified Scape.
It should be noted that for which kind of beamforming algorithm progress Wave beam forming processing of use, in the embodiment of the present application It is not particularly limited, can be chosen according to actual needs by those of ordinary skill in the art, for example, being adopted in the embodiment of the present application Wave beam forming processing is carried out with generalized sidelobe cancellation algorithm.
In one embodiment, " according to the first alternate scenes classification results and the second alternate scenes classification results, obtain to Identify the target scene classification results of scene ", comprising:
(1) judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification knot Fruit;
(2) if so, identical scene classification result is set as target scene classification results.
In the embodiment of the present application, according to the first alternate scenes classification results and the second alternate scenes classification results, obtain The target scene classification results for taking scene to be identified can take the first alternate scenes classification results and the second alternate scenes to classify and tie The same or value of fruit merges to obtain the target scene classification results of scene to be identified.
Wherein, electronic equipment first determines whether the first alternate scenes classification results and the classification of the second alternate scenes are identical Scene classification as a result, if the first alternate scenes classification results and the second alternate scenes be classified as identical scene classification as a result, Then the identical scene classification result is set as the target scene classification results of scene to be identified by electronic equipment.In addition, if first Alternate scenes classification results and the second alternate scenes are classified as identical scene classification as a result, electronic equipment judgement treats knowledge when secondary The identification operation failure of other scene, the double-channel audio frequency signal reacquired to scene to be identified are identified.
For example, the first candidate classification result is " subway scene ", the second candidate classification result is also " subway scene ", electronics Equipment will the target scene classification results of " subway scene " as scene to be identified.
In one embodiment, before " carrying out audio collection to scene to be identified by two microphones ", further includes:
(1) double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
(2) the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple The first sample set of scene known to different type;
(3) construct residual error convolutional neural networks model, and according to first sample set to residual error convolutional neural networks model into Row training, is set as the first scene classification model for the residual error convolutional neural networks model after training.
The embodiment of the present application further provides for training and obtains the scheme of the first scene classification model, as follows:
Electronic equipment obtains the dual-channel audio letter that scene known to multiple and different types is obtained by two microphones first Number.Wherein, when obtaining the double-channel audio frequency signal of scene known to multiple and different types, on the one hand, electronic equipment can be by correlation Technical staff carries into multiple different types of known scenes, and in the scene of each known type, triggers electronic equipment Carry out the acquisition of audio signal.On the other hand, electronic equipment is when triggering obtains audio signal, passes through two microphones acquisition the One preset duration (suitable duration can be configured according to actual needs by those skilled in the art, for example, being configurable to 5 minutes) Double-channel audio frequency signal;Referring to figure 4., after the double-channel audio frequency signal for collecting the first preset duration, scene class is provided Type information input interface, and the scenetype information (scene type inputted is received by the scenetype information input interface Information is inputted by related technical personnel, for example, electronic equipment carrying is carried out audio in subway carriage in related technical personnel It, then can be using input scene type information as subway carriage scene when signal acquisition);The scenetype information for receiving input it Afterwards, collected double-channel audio frequency signal is associated with by electronic equipment with the scenetype information received.
The available double-channel audio frequency signal to scene known to corresponding different type of electronic equipment as a result, for example, dining room The audio of scene known to the different types such as scene, subway carriage scene, bus scene, office scenarios and street scene is believed Number.
In addition, for same type scene, can be obtained when obtaining the double-channel audio frequency signal of scene known to different type Take the type scene preset quantity (can configure according to actual needs suitable quantity by those skilled in the art, for example, can match It is set to 50) double-channel audio frequency signal, for example, available same bus is in the double of different periods for bus scene Channel audio signal gets 50 double-channel audio frequency signals of the bus altogether, can also obtain the binary channels of different buses Audio signal gets the double-channel audio frequency signal etc. of 50 buses altogether.
It should be noted that can create when obtaining a plurality of double-channel audio frequency signal of same type scene to receive The file for the scenetype information name arrived, the same type of a plurality of double-channel audio frequency signal that will acquire are stored in same text In part folder.
In the embodiment of the present application, electronic equipment the double-channel audio frequency signal for getting scene known to multiple and different types it Afterwards, the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is further extracted, it is more to construct correspondence The first sample set of scene known to a different type.
For example, referring to figure 5., by taking the audio signal all the way in double-channel audio frequency signal as an example, electronic equipment is first to this Road audio signal is pre-processed, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) indicates that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, and a is correction factor, generally Take 0.95-0.97;Then framing windowing process is carried out to filtered audio signal, is obtained with smooth aforementioned audio signal framing Audio frame edge, such as use Hamming window form adding windowThen, to adding window Audio frame afterwards carries out Fourier transformation, such as Fast Fourier Transform (FFT), then carries out the extraction of mel-frequency cepstrum coefficient, In, Fourier transformation result is filtered by Meier filter group, obtains the mel-frequency for meeting human auditory system habit, so After take the logarithm by Conversion of measurement unit to be difference, mathematic(al) representation isWherein Fmel(f) it indicates to obtain The mel-frequency arrived, f are the frequency point after Fourier transformation.Then, electronic equipment carries out discrete cosine to mel-frequency is got Transformation, obtains mel-frequency cepstrum coefficient.Correspondingly, electronic equipment will extract bilateral for any double-channel audio frequency signal The mel-frequency cepstrum coefficient in road.
After the mel-frequency cepstrum coefficient that extraction obtains the double-channel audio frequency signal of all types of known scenes, electronics is set It is standby will the corresponding scenetype information association of each twin-channel mel-frequency cepstrum coefficient, it is multiple and different with building correspondence The first sample set of scene known to type.
After building obtains first sample set, electronic equipment further constructs the residual error convolutional neural networks mould of initialization Type, and the training for having supervision is carried out based on residual error convolutional neural networks model of the first sample set to building, after being trained Residual error convolutional neural networks model, as the first scene classification model.
For example, structure based on electronic equipment Resnet-50, the input dimension for being inputted vector dimension and data is kept It is identical, it modifies to the node of last classification layer and is allowed to be equal to all categories quantity, the residual error thus initialized Convolutional neural networks.
In one embodiment, " the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy is special Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate Scene classification result ", comprising:
(1) the mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
(2) by the residual error convolutional Neural after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability Value;
It (3), will when the most probable value of the residual error convolutional neural networks model output after training reaches predetermined probabilities value The corresponding scene classification result of most probable value of residual error convolutional neural networks model output after training is set as the first candidate field Scape classification results.
It is obtained as noted previously, as the first scene classification model is based on the training of twin-channel mel-frequency cepstrum coefficient, phase It answers, electronic equipment extracts dual-channel audio when identifying by the first scene classification model to scene to be identified first The mel-frequency cepstrum coefficient of signal, is set as the first acoustic feature, wherein for how to extract to obtain mel-frequency cepstrum system Number, specifically can refer to the associated description of above embodiments, details are not described herein again.
Electronic equipment extracts to obtain the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of scene to be identified, and is set After the first acoustic feature, after the mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted being inputted training Residual error convolutional neural networks model is predicted.Wherein, the residual error convolutional neural networks after training will export multiple possible fields The probability value of scape classification results and these possible scene classification results.Correspondingly, after electronic equipment will acquire training Multiple scene classification results of residual error convolutional neural networks model output and its corresponding probability value.
It should be noted that the predetermined probabilities value for being provided with screening scene classification result in the embodiment of the present application (specifically may be used Empirical value is taken according to actual needs by those of ordinary skill in the art, for example, value is 0.76) electronics in the embodiment of the present application Equipment may determine that whether the most probable value of the residual error convolutional neural networks model output after training reaches predetermined probabilities value, if Reach, then the corresponding scene classification knot of most probable value that electronic equipment exports the residual error convolutional neural networks model after training Fruit is set as the first alternate scenes classification results.
In one embodiment, " double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones " Later, further includes:
(1) double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
(2) each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize is extracted, Second sample set of scene known to the corresponding multiple and different types of building;
(3) lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model, Lightweight convolutional neural networks model after being optimized;
(4) the lightweight convolutional neural networks model after optimization is trained according to the second sample set, after training Lightweight convolutional neural networks model is set as the second scene classification model.
The embodiment of the present application also provides training and obtains the scheme of the second scene classification model, as follows:
Wherein, electronic equipment is in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones Later, the double-channel audio frequency signal of scene known to multiple and different types is also synthesized into channel audio signal respectively, thus To the channel audio signal of scene known to multiple and different types.
Then, electronic equipment further extracts after the channel audio signal that synthesis obtains all types of known scenes Each channel energy regularization feature of the channel audio signal of all types of known scenes, to construct corresponding multiple and different types Second sample set of known scene.
For example, please refer to Fig. 6, by taking certain channel audio signal as an example, electronic equipment first to channel audio signal into Row pretreatment, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) table Show that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, a is correction factor, generally takes 0.95-0.97; Then framing windowing process is carried out to filtered audio signal, with the side for the audio frame that smooth aforementioned audio signal framing obtains Edge, for example use the form adding window of Hamming windowThen, the audio frame after adding window is carried out Fourier transformation, such as Fast Fourier Transform (FFT), then carry out the extraction of mel-frequency cepstrum coefficient, wherein it is filtered by Meier Device group is filtered Fourier transformation result, obtains the mel-frequency for meeting human auditory system habit, then takes logarithm by unit Difference is converted to, mathematic(al) representation isWherein Fmel(f) indicate that the mel-frequency got, f are Frequency point after Fourier transformation.Then, electronic equipment is smoothed to mel-frequency is got, mathematic(al) representation be M (t, F)=(1-s) M (t-1, f)+sE (t, f), M (t, f) indicate sharpening result, by the weight s of audio frame each in timing come into Row adjustment synthesis obtains, and wherein t, f respectively indicate time and frequency.Finally, electronic equipment carries out each channel energy to sharpening result The extraction of regularization feature is measured, mathematic(al) representation is μ is positive number minimum in order to avoid divisor is 0, parameterIt is the dynamic parameter that can learn.
After each channel energy regularization feature that extraction obtains the channel audio signal of all types of known scenes, electricity The corresponding scenetype information association of each channel energy regularization feature that sub- equipment will be extracted, it is more with building correspondence Second sample set of scene known to a different type.
After building obtains the second sample set, electronic equipment further constructs the lightweight convolutional neural networks of initialization Model, and processing is optimized to the lightweight convolutional neural networks model of building, the lightweight convolutional Neural after being optimized Network model, then the training for having supervision is carried out to the lightweight convolutional neural networks model after optimization based on the second sample set, it obtains Lightweight convolutional neural networks model after to training, as the second scene classification model.
For example, electronic equipment structure based on Xception network, processing is optimized to it, so that it passes through separation Convolution learn on 36 convolutional layers, and operates in all pondizations of the 32nd layer, 34 layers and 36 layers progress, and by three kinds of spies Sign carries out feature synthesis and carries out last classification.Further, it is also possible to (such as using scene Focalloss bad to classifying quality The scenes such as park) compensate formula training.Model training and convergence are finally carried out in deep learning frame tensorflow, and Accuracy test is carried out after training and carries out quantization compression, obtains the second scene classification model.
In one embodiment, " the second acoustics for extracting channel audio signal according to the second default feature extraction strategy is special Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate Scene classification result ", comprising:
(1) each channel energy regularization feature for extracting channel audio signal, by each channel of channel audio signal Energy regularization feature is set as the second acoustic feature;
(2) by the lightweight convolutional Neural net after each channel energy regularization feature input training of channel audio signal Network model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
(3) when the most probable value of the lightweight convolutional neural networks model output after training reaches predetermined probabilities value, The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after training is set as the second time Selected scenes scape classification results.
It is obtained as noted previously, as the second scene classification model is based on the training of each channel energy regularization feature, correspondingly, Electronic equipment extracts channel audio signal when identifying by the second scene classification model to scene to be identified first Each channel energy regularization feature, is set as the second acoustic feature, wherein for how to extract to obtain each channel energy regularization spy Sign, specifically can refer to the associated description of above embodiments, details are not described herein again.
Electronic equipment extracts to obtain each channel energy regularization feature of the channel audio signal of scene to be identified, and will It is set as after the second acoustic feature, can input each channel energy regularization feature of the channel audio signal extracted Lightweight convolutional neural networks model after training is predicted.Wherein, the lightweight convolutional neural networks model after training will Export multiple possible scene classifications as a result, and these possible scene classification results probability value.Correspondingly, electronic equipment Multiple scene classification results of lightweight convolutional neural networks model output after will acquire training and its corresponding probability value.
It should be noted that the predetermined probabilities value for being provided with screening scene classification result in the embodiment of the present application (specifically may be used Empirical value is taken according to actual needs by those of ordinary skill in the art, for example, value is 0.76) electronics in the embodiment of the present application Equipment may determine that whether the most probable value of the lightweight convolutional neural networks model output after training reaches predetermined probabilities value, If reaching, the corresponding scene point of the most probable value that electronic equipment exports the lightweight convolutional neural networks model after training Class result is set as the second alternate scenes classification results.
Below by the basis of the method that above-described embodiment describes, further Jie is done to the scene recognition method of the application It continues.Fig. 7 is please referred to, which may include:
In 201, electronic equipment is believed by the dual-channel audio that two microphones obtain scene known to multiple and different types Number, and residual error convolutional neural networks model is obtained according to the training of the double-channel audio frequency signal of scene known to multiple and different types.
Wherein, electronic equipment obtains the binary channels sound that scene known to multiple and different types is obtained by two microphones first Frequency signal.Wherein, when obtaining the double-channel audio frequency signal of scene known to multiple and different types, on the one hand, electronic equipment can be by Related technical personnel carry into multiple different types of known scenes, and in the scene of each known type, trigger electronics The acquisition of equipment progress audio signal.On the other hand, electronic equipment passes through two Mike's elegances when triggering obtains audio signal Collect the first preset duration (can be configured according to actual needs suitable duration, for example, being configurable to 5 points by those skilled in the art Clock) double-channel audio frequency signal;Referring to figure 4., after the double-channel audio frequency signal for collecting the first preset duration, field is provided Scape type information input interface, and the scenetype information (scene inputted is received by the scenetype information input interface Type information is inputted by related technical personnel, is carried out in subway carriage for example, carrying electronic equipment in related technical personnel It, then can be using input scene type information as subway carriage scene when audio signal sample);In the scene type letter for receiving input After breath, collected double-channel audio frequency signal is associated with by electronic equipment with the scenetype information received.
The available double-channel audio frequency signal to scene known to corresponding different type of electronic equipment as a result, for example, dining room The audio of scene known to the different types such as scene, subway carriage scene, bus scene, office scenarios and street scene is believed Number.
In addition, for same type scene, can be obtained when obtaining the double-channel audio frequency signal of scene known to different type Take the type scene preset quantity (can configure according to actual needs suitable quantity by those skilled in the art, for example, can match It is set to 50) double-channel audio frequency signal, for example, available same bus is in the double of different periods for bus scene Channel audio signal gets 50 double-channel audio frequency signals of the bus altogether, can also obtain the binary channels of different buses Audio signal gets the double-channel audio frequency signal etc. of 50 buses altogether.
It should be noted that can create when obtaining a plurality of double-channel audio frequency signal of same type scene to receive The file for the scenetype information name arrived, the same type of a plurality of double-channel audio frequency signal that will acquire are stored in same text In part folder.
In the embodiment of the present application, electronic equipment the double-channel audio frequency signal for getting scene known to multiple and different types it Afterwards, the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is further extracted, it is more to construct correspondence The first sample set of scene known to a different type.
For example, referring to figure 5., by taking the audio signal all the way in double-channel audio frequency signal as an example, electronic equipment is first to this Road audio signal is pre-processed, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) indicates that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, and a is correction factor, generally Take 0.95-0.97;Then framing windowing process is carried out to filtered audio signal, is obtained with smooth aforementioned audio signal framing Audio frame edge, such as use Hamming window form adding windowThen, to adding window Audio frame afterwards carries out Fourier transformation, such as Fast Fourier Transform (FFT), then carries out the extraction of mel-frequency cepstrum coefficient, In, Fourier transformation result is filtered by Meier filter group, obtains the mel-frequency for meeting human auditory system habit, so After take the logarithm by Conversion of measurement unit to be difference, mathematic(al) representation isWherein Fmel(f) it indicates to obtain The mel-frequency arrived, f are the frequency point after Fourier transformation.Then, electronic equipment carries out discrete cosine to mel-frequency is got Transformation, obtains mel-frequency cepstrum coefficient.Correspondingly, electronic equipment will extract bilateral for any double-channel audio frequency signal The mel-frequency cepstrum coefficient in road.
After the mel-frequency cepstrum coefficient that extraction obtains the double-channel audio frequency signal of all types of known scenes, electronics is set It is standby will the corresponding scenetype information association of each twin-channel mel-frequency cepstrum coefficient, it is multiple and different with building correspondence The first sample set of scene known to type.
After building obtains first sample set, electronic equipment further constructs the residual error convolutional neural networks mould of initialization Type, and the training for having supervision is carried out based on residual error convolutional neural networks model of the first sample set to building, after being trained Residual error convolutional neural networks model.
For example, structure based on electronic equipment Resnet-50, the input dimension for being inputted vector dimension and data is kept It is identical, it modifies to the node of last classification layer and is allowed to be equal to all categories quantity, the residual error thus initialized Convolutional neural networks.
In 202, the double-channel audio frequency signal of scene known to multiple and different types is synthesized single-pass respectively by electronic equipment Audio channel signal, and lightweight convolutional Neural net is obtained according to the training of the channel audio signal of scene known to multiple and different types Network model.
Wherein, electronic equipment is in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones Later, the double-channel audio frequency signal of scene known to multiple and different types is also synthesized into channel audio signal respectively, thus To the channel audio signal of scene known to multiple and different types.
Then, electronic equipment further extracts after the channel audio signal that synthesis obtains all types of known scenes Each channel energy regularization feature of the channel audio signal of all types of known scenes, to construct corresponding multiple and different types Second sample set of known scene.
For example, please refer to Fig. 6, by taking certain channel audio signal as an example, electronic equipment first to channel audio signal into Row pretreatment, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) table Show that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, a is correction factor, generally takes 0.95-0.97; Then framing windowing process is carried out to filtered audio signal, with the side for the audio frame that smooth aforementioned audio signal framing obtains Edge, for example use the form adding window of Hamming windowThen, to the audio frame after adding window into Row Fourier transformation, such as Fast Fourier Transform (FFT), then carry out the extraction of mel-frequency cepstrum coefficient, wherein it is filtered by Meier Wave device group is filtered Fourier transformation result, obtains the mel-frequency for meeting human auditory system habit, then takes logarithm will be single Position is converted to difference, and mathematic(al) representation isWherein Fmel(f) the Meier frequency got is indicated Rate, f are the frequency point after Fourier transformation.Then, electronic equipment is smoothed to mel-frequency is got, mathematic(al) representation Sharpening result is indicated for M (t, f)=(1-s) M (t-1, f)+sE (t, f), M (t, f), passes through the weight of each audio frame in timing S obtains to be adjusted synthesis, wherein t, and f respectively indicates time and frequency.Finally, electronic equipment carries out each lead to sharpening result The extraction of road energy regularization feature, mathematic(al) representation areμ is positive number minimum in order to avoid divisor is 0, parameterIt is the dynamic parameter that can learn.
After each channel energy regularization feature that extraction obtains the channel audio signal of all types of known scenes, electricity The corresponding scenetype information association of each channel energy regularization feature that sub- equipment will be extracted, it is more with building correspondence Second sample set of scene known to a different type.
After building obtains the second sample set, electronic equipment further constructs the lightweight convolutional neural networks of initialization Model, and processing is optimized to the lightweight convolutional neural networks model of building, the lightweight convolutional Neural after being optimized Network model, then the training for having supervision is carried out to the lightweight convolutional neural networks model after optimization based on the second sample set, it obtains Lightweight convolutional neural networks model after to training.
For example, electronic equipment structure based on Xception network, processing is optimized to it, so that it passes through separation Convolution learn on 36 convolutional layers, and operates in all pondizations of the 32nd layer, 34 layers and 36 layers progress, and by three kinds of spies Sign carries out feature synthesis and carries out last classification.Further, it is also possible to (such as using scene Focalloss bad to classifying quality The scenes such as park) compensate formula training.Model training and convergence are finally carried out in deep learning frame tensorflow, and Accuracy test is carried out after training and carries out quantization compression.
In 203, electronic equipment carries out audio collection to scene to be identified by two microphones, obtains dual-channel audio Signal.
Wherein, scene to be identified can be the scene that electronic equipment is presently in.Electronic equipment passes through two Mikes first Wind carries out audio collection to scene to be identified, for example, electronic equipment can when the scene being presently in is set as scene to be identified Audio collection is carried out to the scene being presently in synchronize by two microphones, obtains the identical dual-channel audio letter of duration Number.
In 204, electronic equipment calls residual error convolutional neural networks model after training, the binary channels based on scene to be identified Audio signal obtains the first scene classification result of scene to be identified.
Electronic equipment further extracts dual-channel audio after the double-channel audio frequency signal for collecting scene to be identified The mel-frequency cepstrum coefficient of signal, and the mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted is inputted into training Residual error convolutional neural networks model afterwards, multiple scene classification knots of the residual error convolutional neural networks model output after being trained Fruit and its corresponding probability value;The most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities When value, electronic equipment by after training residual error convolutional neural networks model export the corresponding scene classification result of most probable value It is set as the first alternate scenes classification results.
In 205, the double-channel audio frequency signal of scene to be identified is synthesized channel audio signal by electronic equipment, and is adjusted With the lightweight convolutional neural networks model after training, the channel audio signal based on scene to be identified obtains scene to be identified The second scene classification result.
In addition, the double-channel audio frequency signal of scene to be identified is also synthesized channel audio signal by electronic equipment, and mention Each channel energy regularization feature for taking channel audio signal, by each channel energy regularization feature of channel audio signal If the lightweight convolutional neural networks model after input training, the lightweight convolutional neural networks after being trained export multiple Scene classification result and its corresponding probability value;The most probable value of lightweight convolutional neural networks model output after training When reaching predetermined probabilities value, by the corresponding scene point of most probable value of the lightweight convolutional neural networks model output after training Class result is set as the second alternate scenes classification results.
In 206, electronic equipment judges whether the first alternate scenes classification results and the classification of the second alternate scenes are identical Scene classification as a result, being that identical scene classification result is set as target scene classification results.
Wherein, electronic equipment judges whether the first alternate scenes classification results and the classification of the second alternate scenes are identical field Scape classification results, if the first alternate scenes classification results and the second alternate scenes be classified as identical scene classification as a result, if electricity The identical scene classification result is set as the target scene classification results of scene to be identified by sub- equipment.In addition, if first is candidate Scene classification result and the second alternate scenes are classified as identical scene classification as a result, electronic equipment judges when secondary to field to be identified The identification operation failure of scape, the double-channel audio frequency signal reacquired to scene to be identified are identified.
For example, the first candidate classification result is " subway scene ", the second candidate classification result is also " subway scene ", electronics Equipment will the target scene classification results of " subway scene " as scene to be identified.
In one embodiment, a kind of scene Recognition device is additionally provided.Fig. 8 is please referred to, Fig. 8 provides for the embodiment of the present application Scene Recognition device structural schematic diagram.Wherein the scene Recognition device is applied to electronic equipment, which includes two A microphone, the scene Recognition device include audio collection module 301, the first categorization module 302, audio synthesis module 303, Module 305 is integrated in two categorization modules 304 and classification, wherein as follows:
Audio collection module 301 obtains binary channels for carrying out audio collection to scene to be identified by two microphones Audio signal;
First categorization module 302, for extracting the first of double-channel audio frequency signal according to the first default feature extraction strategy Acoustic feature, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the One alternate scenes classification results;
Audio synthesis module 303 obtains single channel audio letter for carrying out audio synthesis processing to double-channel audio frequency signal Number;
Second categorization module 304, for extracting the second of channel audio signal according to the second default feature extraction strategy Acoustic feature, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the Two alternate scenes classification results;
Module 305 is integrated in classification, is used for according to the first alternate scenes classification results and the second alternate scenes classification results, Obtain the target scene classification results of scene to be identified.
In one embodiment, audio synthesis processing is being carried out to double-channel audio frequency signal, when obtaining channel audio signal, Audio synthesis module 303 is used for:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
In one embodiment, it according to the first alternate scenes classification results and the second alternate scenes classification results, obtains When the target scene classification results of scene to be identified, classification is integrated module 305 and is used for:
Judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification result;
If so, identical scene classification result is set as target scene classification results.
In one embodiment, scene Recognition device further includes model training module, is passing through two microphones to be identified Before scene carries out audio collection, it is used for:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple and different The first sample set of scene known to type;
Residual error convolutional neural networks model is constructed, and residual error convolutional neural networks model is instructed according to first sample set Practice, the residual error convolutional neural networks model after training is set as the first scene classification model.
In one embodiment, special in the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate When scene classification result, the first categorization module 302 is used for:
The mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
By the residual error convolutional Neural net after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability Value;
When the most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities value, it will train The corresponding scene classification result of most probable value of residual error convolutional neural networks model output afterwards is set as the first alternate scenes point Class result.
In one embodiment, in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones Later, model training module is also used to:
The double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
Extract each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize, building Second sample set of scene known to corresponding multiple and different types;
Lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model, is obtained Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after optimization is trained according to the second sample set, by the light weight after training Change convolutional neural networks model and is set as the second scene classification model.
In one embodiment, special in the second acoustics for extracting channel audio signal according to the second default feature extraction strategy Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate When scene classification result, the second categorization module 303 is used for:
Each channel energy regularization feature for extracting channel audio signal, by each channel energy of channel audio signal Regularization feature is set as the second acoustic feature;
By the lightweight convolutional neural networks after each channel energy regularization feature input training of channel audio signal Model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
When the most probable value of lightweight convolutional neural networks model output after training reaches predetermined probabilities value, it will instruct The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after white silk is set as the second candidate field Scape classification results.
It should be noted that the audio verification side in scene Recognition device provided by the embodiments of the present application and foregoing embodiments It is owned by France that either offer method in audio method of calibration embodiment can be run on apparatus for processing audio in same design, Specific implementation process is detailed in characteristic-acquisition method embodiment, and details are not described herein again.
In one embodiment, a kind of electronic equipment is also provided.Fig. 9 is please referred to, electronic equipment includes processor 401, storage Device 402 and two microphones 403.
Processor 401 in the embodiment of the present application is general processor, such as the processor of ARM framework.
It is stored with computer program in memory 402, can be high-speed random access memory, can also be non-volatile Property memory, such as at least one disk memory, flush memory device or other volatile solid-state parts etc..Correspondingly, Memory 402 can also include Memory Controller, to provide access of the processor 401 to computer program in memory 402, It implements function such as:
Audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance The first experienced scene classification model is based on the first acoustic feature and carries out scene classification, obtains the first alternate scenes classification results;
Audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance The second experienced scene classification model is based on the second acoustic feature and carries out scene classification, obtains the second alternate scenes classification results;
According to the first alternate scenes classification results and the second alternate scenes classification results, the target of scene to be identified is obtained Scene classification result.
Please refer to Figure 10, Figure 10 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application, and shown in Fig. 6 The difference of electronic equipment is that electronic equipment further includes the components such as input unit 404 and output unit 405.
Wherein, input unit 404 can be used for receiving the number of input, character information or user's characteristic information (for example refer to Line), and to generate related with user setting and function control keyboard, mouse, operating stick, optics or trackball signal defeated Enter.
Output unit 405 can be used for showing information input by user or the information for being supplied to user, such as screen.
In the embodiment of the present application, processor 401 in electronic equipment can according to following step, by one or one with On computer program process it is corresponding instruction be loaded into memory 402, and by processor 501 operation be stored in memory Computer program in 402, thus realize various functions, it is as follows:
Audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance The first experienced scene classification model is based on the first acoustic feature and carries out scene classification, obtains the first alternate scenes classification results;
Audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance The second experienced scene classification model is based on the second acoustic feature and carries out scene classification, obtains the second alternate scenes classification results;
According to the first alternate scenes classification results and the second alternate scenes classification results, the target of scene to be identified is obtained Scene classification result.
In one embodiment, audio synthesis processing is being carried out to double-channel audio frequency signal, when obtaining channel audio signal, Processor 501 can execute:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
In one embodiment, it according to the first alternate scenes classification results and the second alternate scenes classification results, obtains When the target scene classification results of scene to be identified, processor 501 can be executed:
Judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification result;
If so, identical scene classification result is set as target scene classification results.
In one embodiment, before carrying out audio collection to scene to be identified by two microphones, processor 501 can To execute:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple and different The first sample set of scene known to type;
Residual error convolutional neural networks model is constructed, and residual error convolutional neural networks model is instructed according to first sample set Practice, the residual error convolutional neural networks model after training is set as the first scene classification model.
In one embodiment, special in the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate When scene classification result, processor 501 be can also be performed:
The mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
By the residual error convolutional Neural net after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability Value;
When the most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities value, it will train The corresponding scene classification result of most probable value of residual error convolutional neural networks model output afterwards is set as the first alternate scenes point Class result.
In one embodiment, in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones Later, processor 501 can also be performed:
The double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
Extract each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize, building Second sample set of scene known to corresponding multiple and different types;
Lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model, is obtained Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after optimization is trained according to the second sample set, by the light weight after training Change convolutional neural networks model and is set as the second scene classification model.
In one embodiment, special in the second acoustics for extracting channel audio signal according to the second default feature extraction strategy Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate When scene classification result, processor 501 can be executed:
Each channel energy regularization feature for extracting channel audio signal, by each channel energy of channel audio signal Regularization feature is set as the second acoustic feature;
By the lightweight convolutional neural networks after each channel energy regularization feature input training of channel audio signal Model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
When the most probable value of lightweight convolutional neural networks model output after training reaches predetermined probabilities value, it will instruct The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after white silk is set as the second candidate field Scape classification results.
It should be noted that the scene recognition method category in electronic equipment provided by the embodiments of the present application and foregoing embodiments It, on an electronic device can be with either offer method in Run-time scenario recognition methods embodiment, specific implementation in same design Process is detailed in feature extracting method embodiment, and details are not described herein again.
It should be noted that for the scene recognition method of the embodiment of the present application, this field common test personnel can be with Understand all or part of the process for realizing the scene recognition method of the embodiment of the present application, is that can be controlled by computer program Relevant hardware is completed, and the computer program can be stored in a computer-readable storage medium, be such as stored in electronics In the memory of equipment, and by the electronic equipment processor and dedicated voice identification chip execute, in the process of implementation may be used The process of embodiment including such as scene recognition method.Wherein, the storage medium can for magnetic disk, CD, read-only memory, Random access memory etc..
Above to a kind of scene recognition method, device, storage medium and electronic equipment provided by the embodiment of the present application into It has gone and has been discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above implementation The explanation of example is merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, according to According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification It should not be construed as the limitation to the application.

Claims (10)

1. a kind of scene recognition method is applied to electronic equipment, which is characterized in that the electronic equipment includes two microphones, The scene recognition method includes:
Audio collection is carried out to scene to be identified by described two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of the double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance The first experienced scene classification model is based on first acoustic feature and carries out scene classification, obtains the first alternate scenes classification knot Fruit;
Audio synthesis processing is carried out to the double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of the channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance The second experienced scene classification model is based on second acoustic feature and carries out scene classification, obtains the second alternate scenes classification knot Fruit;
According to the first alternate scenes classification results and the second alternate scenes classification results, the field to be identified is obtained The target scene classification results of scape.
2. scene recognition method according to claim 1, which is characterized in that described to be carried out to the double-channel audio frequency signal Audio synthesis processing, obtains channel audio signal, comprising:
It synthesizes the double-channel audio frequency signal to obtain channel audio signal according to default beamforming algorithm.
3. scene recognition method according to claim 1, which is characterized in that described to be classified according to first alternate scenes As a result and the second alternate scenes classification results, the target scene classification results of the scene to be identified are obtained, comprising:
Judge whether the first alternate scenes classification results and second alternate scenes classification are identical scene classification knot Fruit;
If so, the identical scene classification result is set as the target scene classification results.
4. scene recognition method according to claim 1-3, which is characterized in that described to pass through described two Mikes Wind carries out scene to be identified before audio collection, further includes:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by described two microphones;
Extract the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes, the corresponding the multiple difference of building The first sample set of scene known to type;
Construct residual error convolutional neural networks model, and according to the first sample set to the residual error convolutional neural networks model into Row training, is set as the first scene classification model for the residual error convolutional neural networks model after training.
5. scene recognition method according to claim 4, which is characterized in that described according to the first default feature extraction strategy The first acoustic feature of the double-channel audio frequency signal is extracted, and it is described to call the first scene classification model of training to be in advance based on First acoustic feature carries out scene classification, obtains the first alternate scenes classification results, comprising:
The mel-frequency cepstrum coefficient for extracting the double-channel audio frequency signal is set as first acoustic feature;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted is inputted to the residual error convolution mind after the training Through network model, the multiple scene classification results and its correspondence of the residual error convolutional neural networks model output after obtaining the training Probability value;
It, will be described when the most probable value of residual error convolutional neural networks model output after the training reaches predetermined probabilities value The corresponding scene classification result of most probable value of residual error convolutional neural networks model output after training is set as the first candidate field Scape classification results.
6. scene recognition method according to claim 4, which is characterized in that described more by the acquisition of described two microphones After the double-channel audio frequency signal of scene known to a different type, further includes:
The double-channel audio frequency signal of scene known to the multiple different type is synthesized into channel audio signal respectively;
Each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize is extracted, building corresponds to Second sample set of scene known to the multiple different type;
Lightweight convolutional neural networks model is constructed, and processing is optimized to the lightweight convolutional neural networks model, is obtained Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after the optimization is trained according to second sample set, after training Lightweight convolutional neural networks model is set as the second scene classification model.
7. scene recognition method according to claim 6, which is characterized in that extracted according to the second default feature extraction strategy Second acoustic feature of the channel audio signal, and the second scene classification model of training in advance is called to be based on described second Acoustic feature carries out scene classification, obtains the second alternate scenes classification results, comprising:
Each channel energy regularization feature for extracting the channel audio signal, by each channel of the channel audio signal Energy regularization feature is set as second acoustic feature;
Each channel energy regularization feature of the channel audio signal is inputted into the lightweight convolutional Neural after the training Network model, multiple scene classification results of the lightweight convolutional neural networks output after obtaining the training and its corresponding general Rate value;
When the most probable value of lightweight convolutional neural networks model output after the training reaches the predetermined probabilities value, The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after the training is set as institute State the second alternate scenes classification results.
8. a kind of scene Recognition device, it is applied to electronic equipment, which is characterized in that the scene Recognition device includes:
Audio collection module obtains binary channels sound for carrying out audio collection to scene to be identified by described two microphones Frequency signal;
First categorization module, for extracting the first acoustics of the double-channel audio frequency signal according to the first default feature extraction strategy Feature, and call the first scene classification model of training to be in advance based on first acoustic feature and carry out scene classification, obtain the One alternate scenes classification results;
Audio synthesis module obtains channel audio signal for carrying out audio synthesis processing to the double-channel audio frequency signal;
Second categorization module, for extracting the second acoustics of the channel audio signal according to the second default feature extraction strategy Feature, and call the second scene classification model of training to be in advance based on second acoustic feature and carry out scene classification, obtain the Two alternate scenes classification results;
Module is integrated in classification, for according to the first alternate scenes classification results and second alternate scenes classification knot Fruit obtains the target scene classification results of the scene to be identified.
9. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program is by processor tune Used time executes scene recognition method as described in any one of claim 1 to 7.
10. a kind of electronic equipment, including processor and memory, the memory storage have computer program, which is characterized in that The processor is by calling the computer program, for executing scene Recognition side as described in any one of claim 1 to 7 Method.
CN201910731749.6A 2019-08-08 2019-08-08 Scene recognition method and device, storage medium and electronic equipment Active CN110473568B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910731749.6A CN110473568B (en) 2019-08-08 2019-08-08 Scene recognition method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910731749.6A CN110473568B (en) 2019-08-08 2019-08-08 Scene recognition method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110473568A true CN110473568A (en) 2019-11-19
CN110473568B CN110473568B (en) 2022-01-07

Family

ID=68510551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910731749.6A Active CN110473568B (en) 2019-08-08 2019-08-08 Scene recognition method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110473568B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750448A (en) * 2020-08-07 2021-05-04 腾讯科技(深圳)有限公司 Sound scene recognition method, device, equipment and storage medium
CN112767967A (en) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 Voice classification method and device and automatic voice classification method
CN113129917A (en) * 2020-01-15 2021-07-16 荣耀终端有限公司 Speech processing method based on scene recognition, and apparatus, medium, and system thereof
WO2021189979A1 (en) * 2020-10-26 2021-09-30 平安科技(深圳)有限公司 Speech enhancement method and apparatus, computer device, and storage medium
CN114220458A (en) * 2021-11-16 2022-03-22 武汉普惠海洋光电技术有限公司 Sound identification method and device based on array hydrophone
WO2022233061A1 (en) * 2021-05-07 2022-11-10 Oppo广东移动通信有限公司 Signal processing method, communication device, and communication system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593522A (en) * 2009-07-08 2009-12-02 清华大学 A kind of full frequency domain digital hearing aid method and apparatus
US20180157920A1 (en) * 2016-12-01 2018-06-07 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for recognizing obstacle of vehicle
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment
CN108831505A (en) * 2018-05-30 2018-11-16 百度在线网络技术(北京)有限公司 The method and apparatus for the usage scenario applied for identification
CN110082135A (en) * 2019-03-14 2019-08-02 中科恒运股份有限公司 Equipment fault recognition methods, device and terminal device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593522A (en) * 2009-07-08 2009-12-02 清华大学 A kind of full frequency domain digital hearing aid method and apparatus
US20180157920A1 (en) * 2016-12-01 2018-06-07 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for recognizing obstacle of vehicle
CN108764304A (en) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 scene recognition method, device, storage medium and electronic equipment
CN108831505A (en) * 2018-05-30 2018-11-16 百度在线网络技术(北京)有限公司 The method and apparatus for the usage scenario applied for identification
CN110082135A (en) * 2019-03-14 2019-08-02 中科恒运股份有限公司 Equipment fault recognition methods, device and terminal device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李琪: "基于深度学习的音频场景识别方法研究", 《硕士学位论文》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129917A (en) * 2020-01-15 2021-07-16 荣耀终端有限公司 Speech processing method based on scene recognition, and apparatus, medium, and system thereof
CN112750448A (en) * 2020-08-07 2021-05-04 腾讯科技(深圳)有限公司 Sound scene recognition method, device, equipment and storage medium
CN112750448B (en) * 2020-08-07 2024-01-16 腾讯科技(深圳)有限公司 Sound scene recognition method, device, equipment and storage medium
WO2021189979A1 (en) * 2020-10-26 2021-09-30 平安科技(深圳)有限公司 Speech enhancement method and apparatus, computer device, and storage medium
CN112767967A (en) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 Voice classification method and device and automatic voice classification method
WO2022233061A1 (en) * 2021-05-07 2022-11-10 Oppo广东移动通信有限公司 Signal processing method, communication device, and communication system
CN114220458A (en) * 2021-11-16 2022-03-22 武汉普惠海洋光电技术有限公司 Sound identification method and device based on array hydrophone
CN114220458B (en) * 2021-11-16 2024-04-05 武汉普惠海洋光电技术有限公司 Voice recognition method and device based on array hydrophone

Also Published As

Publication number Publication date
CN110473568B (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN110473568A (en) Scene recognition method, device, storage medium and electronic equipment
US10621971B2 (en) Method and device for extracting speech feature based on artificial intelligence
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN108564963B (en) Method and apparatus for enhancing voice
CN103426435B (en) The source by independent component analysis with mobile constraint separates
CN110503971A (en) Time-frequency mask neural network based estimation and Wave beam forming for speech processes
JP2019522810A (en) Neural network based voiceprint information extraction method and apparatus
US20160189730A1 (en) Speech separation method and system
CN110428842A (en) Speech model training method, device, equipment and computer readable storage medium
CN108364662B (en) Voice emotion recognition method and system based on paired identification tasks
CN110444202B (en) Composite voice recognition method, device, equipment and computer readable storage medium
CN112949708B (en) Emotion recognition method, emotion recognition device, computer equipment and storage medium
CN112581978A (en) Sound event detection and positioning method, device, equipment and readable storage medium
CN110400571A (en) Audio-frequency processing method, device, storage medium and electronic equipment
CN112562648A (en) Adaptive speech recognition method, apparatus, device and medium based on meta learning
Ting Yuan et al. Frog sound identification system for frog species recognition
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
CN111312223A (en) Training method and device of voice segmentation model and electronic equipment
CN110169082A (en) Combining audio signals output
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN117542373A (en) Non-air conduction voice recovery system and method
CN111863035A (en) Method, system and equipment for recognizing heart sound data
CN110136741A (en) A kind of single-channel voice Enhancement Method based on multiple dimensioned context
CN114333769B (en) Speech recognition method, computer program product, computer device and storage medium
CN109637555A (en) A kind of business meetings japanese voice identification translation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant