CN110473568A - Scene recognition method, device, storage medium and electronic equipment - Google Patents
Scene recognition method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN110473568A CN110473568A CN201910731749.6A CN201910731749A CN110473568A CN 110473568 A CN110473568 A CN 110473568A CN 201910731749 A CN201910731749 A CN 201910731749A CN 110473568 A CN110473568 A CN 110473568A
- Authority
- CN
- China
- Prior art keywords
- scene
- channel audio
- classification
- double
- frequency signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000005236 sound signal Effects 0.000 claims abstract description 116
- 238000013527 convolutional neural network Methods 0.000 claims description 95
- 238000012549 training Methods 0.000 claims description 92
- 238000013145 classification model Methods 0.000 claims description 53
- 238000000605 extraction Methods 0.000 claims description 36
- 230000015572 biosynthetic process Effects 0.000 claims description 26
- 238000012545 processing Methods 0.000 claims description 26
- 238000003786 synthesis reaction Methods 0.000 claims description 26
- 239000000284 extract Substances 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 12
- 235000013399 edible fruits Nutrition 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 9
- 230000001537 neural effect Effects 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 230000005055 memory storage Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000009466 transformation Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000001914 filtration Methods 0.000 description 8
- 238000009432 framing Methods 0.000 description 8
- 230000005611 electricity Effects 0.000 description 5
- 230000002146 bilateral effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000012937 correction Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The embodiment of the present application discloses a kind of scene recognition method, device, storage medium and electronic equipment, wherein, the embodiment of the present application collects the double-channel audio frequency signal of scene to be identified first, then pass through the prediction scheme 2 of prediction scheme 1 and the channel audio signal synthesized based on double-channel audio frequency signal based on double-channel audio frequency signal, two alternate scenes classification results of scene to be identified are acquired, then merges two alternate scenes classification results and obtains the target scene classification results of scene to be identified.As a result, without realizing the identification to scene locating for electronic equipment in conjunction with location technology, also just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application more flexible accurately can be identified scene to be identified locating for electronic equipment.
Description
Technical field
This application involves scene Recognition technical fields, and in particular to a kind of scene recognition method, device, storage medium and electricity
Sub- equipment.
Background technique
Currently, as the electronic equipments such as tablet computer, mobile phone can be by scene locating for analysis user, based on the analysis results
It carries out corresponding processing operation, thus promotes user experience.In the related technology, electronic equipment scene locating for analysis user
When, it is usually realized using GPS positioning, i.e., obtains current location information by GPS positioning, determined according to the location information
Scene locating for electronic equipment that is to say scene locating for user.However, in the environment of interior or more veil,
The relevant technologies are difficult to realize GPS positioning, also can not just identify to the environment scene locating for electronic equipment.
Summary of the invention
The embodiment of the present application provides a kind of scene recognition method, device, storage medium and electronic equipment, can be to electronics
Environment scene locating for equipment is identified.
In a first aspect, a kind of scene recognition method for providing of the embodiment of the present application, is applied to electronic equipment, the electronics
Equipment includes two microphones, which includes:
Audio collection is carried out to scene to be identified by described two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of the double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and is called pre-
First the first scene classification model of training is based on first acoustic feature and carries out scene classification, obtains the classification of the first alternate scenes
As a result;
Audio synthesis processing is carried out to the double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of the channel audio signal is extracted according to the second default feature extraction strategy, and is called pre-
First the second scene classification model of training is based on second acoustic feature and carries out scene classification, obtains the classification of the second alternate scenes
As a result;
According to the first alternate scenes classification results and the second alternate scenes classification results, obtain described wait know
The target scene classification results of other scene.
Second aspect, a kind of scene Recognition device for providing of the embodiment of the present application are applied to electronic equipment, the electronics
Equipment includes two microphones, which includes:
Audio collection module obtains bilateral for carrying out audio collection to scene to be identified by described two microphones
Audio channel signal;
First categorization module, for extracting the first of the double-channel audio frequency signal according to the first default feature extraction strategy
Acoustic feature, and call the first scene classification model of training to be in advance based on first acoustic feature and carry out scene classification, it obtains
To the first alternate scenes classification results;
Audio synthesis module obtains single channel audio for carrying out audio synthesis processing to the double-channel audio frequency signal
Signal;
Second categorization module, for extracting the second of the channel audio signal according to the second default feature extraction strategy
Acoustic feature, and call the second scene classification model of training to be in advance based on second acoustic feature and carry out scene classification, it obtains
To the second alternate scenes classification results;
Module is integrated in classification, for being classified according to the first alternate scenes classification results and second alternate scenes
As a result, obtaining the target scene classification results of the scene to be identified.
The third aspect, storage medium provided by the embodiments of the present application, is stored thereon with computer program, when the computer
The scene recognition method provided such as the application any embodiment is provided when program is called by processor.
Fourth aspect, electronic equipment provided by the embodiments of the present application, including processor and memory, the memory have meter
Calculation machine program, the processor is by calling the computer program, for executing the field provided such as the application any embodiment
Scape recognition methods.
The embodiment of the present application collects the double-channel audio frequency signal of scene to be identified first, then by being based on binary channels
The prediction scheme 2 of the prediction scheme 1 of audio signal and the channel audio signal synthesized based on double-channel audio frequency signal,
Acquire two alternate scenes classification results of scene to be identified, then merge two alternate scenes classification results obtain it is to be identified
The target scene classification results of scene.As a result, without realizing the identification to scene locating for electronic equipment in conjunction with location technology,
Just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application can be more flexible accurately to electronics
Scene to be identified locating for equipment is identified.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for
For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a flow diagram of scene recognition method provided by the embodiments of the present application.
Fig. 2 is the setting schematic diagram of two microphones of electronic equipment in the embodiment of the present application.
Fig. 3 is to be predicted to obtain target candidate scene according to the double-channel audio frequency signal of scene to be identified in the embodiment of the present application
The schematic diagram of classification results.
Fig. 4 is the exemplary diagram of the scenetype information input interface provided in the embodiment of the present application.
Fig. 5 is the schematic diagram that mel-frequency cepstrum coefficient is extracted in the embodiment of the present application.
Fig. 6 is the schematic diagram that each channel energy regularization feature is extracted in the embodiment of the present application.
Fig. 7 is another flow diagram of scene recognition method provided by the embodiments of the present application.
Fig. 8 is a structural schematic diagram of scene Recognition device provided by the embodiments of the present application.
Fig. 9 is a structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Figure 10 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Schema is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one
It is illustrated in computing environment appropriate.The following description is the application specific embodiment illustrated by, should not be by
It is considered as limitation the application other specific embodiments not detailed herein.
The embodiment of the present application provides a kind of scene recognition method, and the executing subject of the scene recognition method can be the application
The scene Recognition device that embodiment provides, or it is integrated with the electronic equipment of the scene Recognition device, wherein the scene Recognition fills
Setting can be realized by the way of hardware or software.Wherein, electronic equipment can be smart phone, tablet computer, palm electricity
The equipment such as brain, laptop or desktop computer.
Fig. 1 is please referred to, Fig. 1 is the flow diagram of scene recognition method provided by the embodiments of the present application, and the application is implemented
The detailed process for the scene recognition method that example provides can be such that
In 101, audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal.
Wherein, scene to be identified can be the scene that electronic equipment is presently in.
It should be noted that electronic equipment includes two microphones, wherein two microphones can included by electronic equipment
To be built-in microphone, being also possible to external microphone (can be wired microphone, is also possible to wireless Mike
Wind), the embodiment of the present application is not particularly limited this.For example, referring to figure 2., electronic equipment includes two and is arranged back-to-back
Microphone is respectively arranged in the microphone 1 of electronic equipment lower side and the microphone 2 of side on an electronic device is arranged,
In, downward, the pickup hole of microphone 2 is upward in the pickup hole of microphone 1.In addition, two microphones can set by electronic equipment
Think non-directive microphone (in other words, omni-directional microphone).
In the embodiment of the present application, electronic equipment passes through two microphones first and carries out audio collection to scene to be identified, than
Such as, when the scene being presently in is set as scene to be identified, electronic equipment can be synchronized by two microphones to current institute
The scene at place carries out audio collection, obtains the identical double-channel audio frequency signal of duration.
It should be noted that assuming that microphone included by electronic equipment is simulation microphone, then simulation will be collected
Audio signal, need the audio signal that will be simulated to carry out analog-to-digital conversion at this time, obtain digitized audio signal, for subsequent
Processing.For example, electronic equipment can after collecting the two-way analog audio signal of acquisition to be identified by two microphones, with
The sample frequency of 16KHz respectively samples two-way analog audio signal, obtains the digitized audio signal of two-way.
One of ordinary skill in the art will appreciate that if microphone included by electronic equipment is digital microphone,
So digitized audio signal will be directly collected, no longer needs to carry out analog-to-digital conversion.
In 102, the first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and adjust
The first acoustic feature is based on the first scene classification model of training in advance and carries out scene classification, obtains the classification of the first alternate scenes
As a result.
It should be noted that training has the first scene classification model and the second scene classification mould in advance in the embodiment of the present application
Type, wherein the type of the first scene classification model and the second scene classification model is different, and the first scene classification model is with bilateral
The acoustic feature in road is input, and the second scene classification model is inputted with single pass acoustic feature, acoustics of the two based on input
The scene classification result that feature is predicted is output.
Correspondingly, electronic equipment is after the double-channel audio frequency signal for collecting scene to be identified, it is default according to first
Feature extraction strategy extracts to obtain the first acoustic feature of double-channel audio frequency signal, is twin-channel acoustic feature.Later, electronics
Equipment will be extracted the first acoustic feature and is input in the first scene classification model of training in advance, by the first scene classification model
The first acoustic feature based on input predicts the scene type of scene to be identified.Later, electronic equipment is by the first scene
First alternate scenes classification results of the scene classification result of disaggregated model prediction output as scene to be identified.
In 103, audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal.
In the embodiment of the present application, electronic equipment also carries out audio synthesis processing to double-channel audio frequency signal, by binary channels sound
Frequency signal synthesizes channel audio signal.For example, the average value of double-channel audio frequency signal can be taken, single channel audio letter is obtained
Number.
It should be noted that 102 and 103 execution sequence is not influenced by serial number size, it can be and executing completion 102
It executes 103 again afterwards, is also possible to execute 102 again after executing completion 103 and 104, is also possible to 102 and 103 and is performed simultaneously.
In 104, the second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and adjusted
The second acoustic feature is based on the second scene classification model of training in advance and carries out scene classification, obtains the classification of the second alternate scenes
As a result.
As described above, in the embodiment of the present application also training have the second scene classification model, the second scene classification model with
Single pass audio frequency characteristics are input.
Correspondingly, electronic equipment is after the double-channel audio frequency signal according to acquisition synthesizes to obtain channel audio signal,
The second acoustic feature that the channel audio signal that synthesis obtains is extracted according to the second default feature extraction strategy is single pass
Acoustic feature.Later, electronic equipment will extract the second acoustic feature and be input in the second scene classification model of training in advance,
The scene type of scene to be identified is predicted based on the second acoustic feature of input by the second scene classification model.Later,
The scene classification result that electronic equipment exports the second scene classification model prediction is as the second alternate scenes of scene to be identified
Classification results.
In 105, according to the first alternate scenes classification results and the second alternate scenes classification results, field to be identified is obtained
The target scene classification results of scape.
In the embodiment of the present application, electronic equipment in the first alternate scenes classification results for acquiring scene to be identified and
It, can be according to the first alternate scenes classification results and the second alternate scenes classification knot after second alternate scenes classification results
Fruit acquires the target scene classification results of scene to be identified.For example, electronic equipment can be with the first alternate scenes classification results
The higher alternate scenes classification results of probability value corresponding with the second alternate scenes classification results are set as the mesh to scene to be identified
Mark scene classification result.
In addition, electronic equipment after acquiring the target scene classification results of scene to be identified, can also be performed pair
Should target scene classification results predetermined registration operation, for example, the target scene classification results for getting scene to be identified be "
When iron scene ", electronic equipment can configure audio output parameters to the audio output ginseng of pre-set corresponding subway scene
Number.
As shown in figure 3, collecting the double-channel audio frequency signal of scene to be identified first, then in the embodiment of the present application
Pass through the prediction scheme 1 based on double-channel audio frequency signal and the single channel audio letter synthesized based on double-channel audio frequency signal
Number prediction scheme 2, acquire two alternate scenes classification results of scene to be identified, then merge the classification of two alternate scenes
As a result the target scene classification results of scene to be identified are obtained.As a result, without realizing in conjunction with location technology to electronic equipment institute
Locate the identification of scene, also just without any restrictions to environment locating for electronic equipment, compared to the relevant technologies, the application can be cleverer
Work accurately identifies scene to be identified locating for electronic equipment.
In one embodiment, " audio synthesis processing is carried out to double-channel audio frequency signal, obtain channel audio signal ", packet
It includes:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
It, can be by the way of Wave beam forming by the single-pass of double-channel audio frequency signal synthesis dimension in the embodiment of the present application
Audio channel signal.Wherein, electronic equipment can be according to default beamforming algorithm to the binary channels for collecting scene to be identified
Audio signal carries out Wave beam forming, obtains enhanced channel audio signal, the enhanced single channel audio obtained as a result,
Retain the sound in original double-channel audio frequency signal from specific direction in signal, can more accurately characterize field to be identified
Scape.
It should be noted that for which kind of beamforming algorithm progress Wave beam forming processing of use, in the embodiment of the present application
It is not particularly limited, can be chosen according to actual needs by those of ordinary skill in the art, for example, being adopted in the embodiment of the present application
Wave beam forming processing is carried out with generalized sidelobe cancellation algorithm.
In one embodiment, " according to the first alternate scenes classification results and the second alternate scenes classification results, obtain to
Identify the target scene classification results of scene ", comprising:
(1) judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification knot
Fruit;
(2) if so, identical scene classification result is set as target scene classification results.
In the embodiment of the present application, according to the first alternate scenes classification results and the second alternate scenes classification results, obtain
The target scene classification results for taking scene to be identified can take the first alternate scenes classification results and the second alternate scenes to classify and tie
The same or value of fruit merges to obtain the target scene classification results of scene to be identified.
Wherein, electronic equipment first determines whether the first alternate scenes classification results and the classification of the second alternate scenes are identical
Scene classification as a result, if the first alternate scenes classification results and the second alternate scenes be classified as identical scene classification as a result,
Then the identical scene classification result is set as the target scene classification results of scene to be identified by electronic equipment.In addition, if first
Alternate scenes classification results and the second alternate scenes are classified as identical scene classification as a result, electronic equipment judgement treats knowledge when secondary
The identification operation failure of other scene, the double-channel audio frequency signal reacquired to scene to be identified are identified.
For example, the first candidate classification result is " subway scene ", the second candidate classification result is also " subway scene ", electronics
Equipment will the target scene classification results of " subway scene " as scene to be identified.
In one embodiment, before " carrying out audio collection to scene to be identified by two microphones ", further includes:
(1) double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
(2) the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple
The first sample set of scene known to different type;
(3) construct residual error convolutional neural networks model, and according to first sample set to residual error convolutional neural networks model into
Row training, is set as the first scene classification model for the residual error convolutional neural networks model after training.
The embodiment of the present application further provides for training and obtains the scheme of the first scene classification model, as follows:
Electronic equipment obtains the dual-channel audio letter that scene known to multiple and different types is obtained by two microphones first
Number.Wherein, when obtaining the double-channel audio frequency signal of scene known to multiple and different types, on the one hand, electronic equipment can be by correlation
Technical staff carries into multiple different types of known scenes, and in the scene of each known type, triggers electronic equipment
Carry out the acquisition of audio signal.On the other hand, electronic equipment is when triggering obtains audio signal, passes through two microphones acquisition the
One preset duration (suitable duration can be configured according to actual needs by those skilled in the art, for example, being configurable to 5 minutes)
Double-channel audio frequency signal;Referring to figure 4., after the double-channel audio frequency signal for collecting the first preset duration, scene class is provided
Type information input interface, and the scenetype information (scene type inputted is received by the scenetype information input interface
Information is inputted by related technical personnel, for example, electronic equipment carrying is carried out audio in subway carriage in related technical personnel
It, then can be using input scene type information as subway carriage scene when signal acquisition);The scenetype information for receiving input it
Afterwards, collected double-channel audio frequency signal is associated with by electronic equipment with the scenetype information received.
The available double-channel audio frequency signal to scene known to corresponding different type of electronic equipment as a result, for example, dining room
The audio of scene known to the different types such as scene, subway carriage scene, bus scene, office scenarios and street scene is believed
Number.
In addition, for same type scene, can be obtained when obtaining the double-channel audio frequency signal of scene known to different type
Take the type scene preset quantity (can configure according to actual needs suitable quantity by those skilled in the art, for example, can match
It is set to 50) double-channel audio frequency signal, for example, available same bus is in the double of different periods for bus scene
Channel audio signal gets 50 double-channel audio frequency signals of the bus altogether, can also obtain the binary channels of different buses
Audio signal gets the double-channel audio frequency signal etc. of 50 buses altogether.
It should be noted that can create when obtaining a plurality of double-channel audio frequency signal of same type scene to receive
The file for the scenetype information name arrived, the same type of a plurality of double-channel audio frequency signal that will acquire are stored in same text
In part folder.
In the embodiment of the present application, electronic equipment the double-channel audio frequency signal for getting scene known to multiple and different types it
Afterwards, the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is further extracted, it is more to construct correspondence
The first sample set of scene known to a different type.
For example, referring to figure 5., by taking the audio signal all the way in double-channel audio frequency signal as an example, electronic equipment is first to this
Road audio signal is pre-processed, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) indicates that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, and a is correction factor, generally
Take 0.95-0.97;Then framing windowing process is carried out to filtered audio signal, is obtained with smooth aforementioned audio signal framing
Audio frame edge, such as use Hamming window form adding windowThen, to adding window
Audio frame afterwards carries out Fourier transformation, such as Fast Fourier Transform (FFT), then carries out the extraction of mel-frequency cepstrum coefficient,
In, Fourier transformation result is filtered by Meier filter group, obtains the mel-frequency for meeting human auditory system habit, so
After take the logarithm by Conversion of measurement unit to be difference, mathematic(al) representation isWherein Fmel(f) it indicates to obtain
The mel-frequency arrived, f are the frequency point after Fourier transformation.Then, electronic equipment carries out discrete cosine to mel-frequency is got
Transformation, obtains mel-frequency cepstrum coefficient.Correspondingly, electronic equipment will extract bilateral for any double-channel audio frequency signal
The mel-frequency cepstrum coefficient in road.
After the mel-frequency cepstrum coefficient that extraction obtains the double-channel audio frequency signal of all types of known scenes, electronics is set
It is standby will the corresponding scenetype information association of each twin-channel mel-frequency cepstrum coefficient, it is multiple and different with building correspondence
The first sample set of scene known to type.
After building obtains first sample set, electronic equipment further constructs the residual error convolutional neural networks mould of initialization
Type, and the training for having supervision is carried out based on residual error convolutional neural networks model of the first sample set to building, after being trained
Residual error convolutional neural networks model, as the first scene classification model.
For example, structure based on electronic equipment Resnet-50, the input dimension for being inputted vector dimension and data is kept
It is identical, it modifies to the node of last classification layer and is allowed to be equal to all categories quantity, the residual error thus initialized
Convolutional neural networks.
In one embodiment, " the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy is special
Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate
Scene classification result ", comprising:
(1) the mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
(2) by the residual error convolutional Neural after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted
Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability
Value;
It (3), will when the most probable value of the residual error convolutional neural networks model output after training reaches predetermined probabilities value
The corresponding scene classification result of most probable value of residual error convolutional neural networks model output after training is set as the first candidate field
Scape classification results.
It is obtained as noted previously, as the first scene classification model is based on the training of twin-channel mel-frequency cepstrum coefficient, phase
It answers, electronic equipment extracts dual-channel audio when identifying by the first scene classification model to scene to be identified first
The mel-frequency cepstrum coefficient of signal, is set as the first acoustic feature, wherein for how to extract to obtain mel-frequency cepstrum system
Number, specifically can refer to the associated description of above embodiments, details are not described herein again.
Electronic equipment extracts to obtain the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of scene to be identified, and is set
After the first acoustic feature, after the mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted being inputted training
Residual error convolutional neural networks model is predicted.Wherein, the residual error convolutional neural networks after training will export multiple possible fields
The probability value of scape classification results and these possible scene classification results.Correspondingly, after electronic equipment will acquire training
Multiple scene classification results of residual error convolutional neural networks model output and its corresponding probability value.
It should be noted that the predetermined probabilities value for being provided with screening scene classification result in the embodiment of the present application (specifically may be used
Empirical value is taken according to actual needs by those of ordinary skill in the art, for example, value is 0.76) electronics in the embodiment of the present application
Equipment may determine that whether the most probable value of the residual error convolutional neural networks model output after training reaches predetermined probabilities value, if
Reach, then the corresponding scene classification knot of most probable value that electronic equipment exports the residual error convolutional neural networks model after training
Fruit is set as the first alternate scenes classification results.
In one embodiment, " double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones "
Later, further includes:
(1) double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
(2) each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize is extracted,
Second sample set of scene known to the corresponding multiple and different types of building;
(3) lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model,
Lightweight convolutional neural networks model after being optimized;
(4) the lightweight convolutional neural networks model after optimization is trained according to the second sample set, after training
Lightweight convolutional neural networks model is set as the second scene classification model.
The embodiment of the present application also provides training and obtains the scheme of the second scene classification model, as follows:
Wherein, electronic equipment is in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones
Later, the double-channel audio frequency signal of scene known to multiple and different types is also synthesized into channel audio signal respectively, thus
To the channel audio signal of scene known to multiple and different types.
Then, electronic equipment further extracts after the channel audio signal that synthesis obtains all types of known scenes
Each channel energy regularization feature of the channel audio signal of all types of known scenes, to construct corresponding multiple and different types
Second sample set of known scene.
For example, please refer to Fig. 6, by taking certain channel audio signal as an example, electronic equipment first to channel audio signal into
Row pretreatment, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) table
Show that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, a is correction factor, generally takes 0.95-0.97;
Then framing windowing process is carried out to filtered audio signal, with the side for the audio frame that smooth aforementioned audio signal framing obtains
Edge, for example use the form adding window of Hamming windowThen, the audio frame after adding window is carried out
Fourier transformation, such as Fast Fourier Transform (FFT), then carry out the extraction of mel-frequency cepstrum coefficient, wherein it is filtered by Meier
Device group is filtered Fourier transformation result, obtains the mel-frequency for meeting human auditory system habit, then takes logarithm by unit
Difference is converted to, mathematic(al) representation isWherein Fmel(f) indicate that the mel-frequency got, f are
Frequency point after Fourier transformation.Then, electronic equipment is smoothed to mel-frequency is got, mathematic(al) representation be M (t,
F)=(1-s) M (t-1, f)+sE (t, f), M (t, f) indicate sharpening result, by the weight s of audio frame each in timing come into
Row adjustment synthesis obtains, and wherein t, f respectively indicate time and frequency.Finally, electronic equipment carries out each channel energy to sharpening result
The extraction of regularization feature is measured, mathematic(al) representation is
μ is positive number minimum in order to avoid divisor is 0, parameterIt is the dynamic parameter that can learn.
After each channel energy regularization feature that extraction obtains the channel audio signal of all types of known scenes, electricity
The corresponding scenetype information association of each channel energy regularization feature that sub- equipment will be extracted, it is more with building correspondence
Second sample set of scene known to a different type.
After building obtains the second sample set, electronic equipment further constructs the lightweight convolutional neural networks of initialization
Model, and processing is optimized to the lightweight convolutional neural networks model of building, the lightweight convolutional Neural after being optimized
Network model, then the training for having supervision is carried out to the lightweight convolutional neural networks model after optimization based on the second sample set, it obtains
Lightweight convolutional neural networks model after to training, as the second scene classification model.
For example, electronic equipment structure based on Xception network, processing is optimized to it, so that it passes through separation
Convolution learn on 36 convolutional layers, and operates in all pondizations of the 32nd layer, 34 layers and 36 layers progress, and by three kinds of spies
Sign carries out feature synthesis and carries out last classification.Further, it is also possible to (such as using scene Focalloss bad to classifying quality
The scenes such as park) compensate formula training.Model training and convergence are finally carried out in deep learning frame tensorflow, and
Accuracy test is carried out after training and carries out quantization compression, obtains the second scene classification model.
In one embodiment, " the second acoustics for extracting channel audio signal according to the second default feature extraction strategy is special
Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate
Scene classification result ", comprising:
(1) each channel energy regularization feature for extracting channel audio signal, by each channel of channel audio signal
Energy regularization feature is set as the second acoustic feature;
(2) by the lightweight convolutional Neural net after each channel energy regularization feature input training of channel audio signal
Network model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
(3) when the most probable value of the lightweight convolutional neural networks model output after training reaches predetermined probabilities value,
The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after training is set as the second time
Selected scenes scape classification results.
It is obtained as noted previously, as the second scene classification model is based on the training of each channel energy regularization feature, correspondingly,
Electronic equipment extracts channel audio signal when identifying by the second scene classification model to scene to be identified first
Each channel energy regularization feature, is set as the second acoustic feature, wherein for how to extract to obtain each channel energy regularization spy
Sign, specifically can refer to the associated description of above embodiments, details are not described herein again.
Electronic equipment extracts to obtain each channel energy regularization feature of the channel audio signal of scene to be identified, and will
It is set as after the second acoustic feature, can input each channel energy regularization feature of the channel audio signal extracted
Lightweight convolutional neural networks model after training is predicted.Wherein, the lightweight convolutional neural networks model after training will
Export multiple possible scene classifications as a result, and these possible scene classification results probability value.Correspondingly, electronic equipment
Multiple scene classification results of lightweight convolutional neural networks model output after will acquire training and its corresponding probability value.
It should be noted that the predetermined probabilities value for being provided with screening scene classification result in the embodiment of the present application (specifically may be used
Empirical value is taken according to actual needs by those of ordinary skill in the art, for example, value is 0.76) electronics in the embodiment of the present application
Equipment may determine that whether the most probable value of the lightweight convolutional neural networks model output after training reaches predetermined probabilities value,
If reaching, the corresponding scene point of the most probable value that electronic equipment exports the lightweight convolutional neural networks model after training
Class result is set as the second alternate scenes classification results.
Below by the basis of the method that above-described embodiment describes, further Jie is done to the scene recognition method of the application
It continues.Fig. 7 is please referred to, which may include:
In 201, electronic equipment is believed by the dual-channel audio that two microphones obtain scene known to multiple and different types
Number, and residual error convolutional neural networks model is obtained according to the training of the double-channel audio frequency signal of scene known to multiple and different types.
Wherein, electronic equipment obtains the binary channels sound that scene known to multiple and different types is obtained by two microphones first
Frequency signal.Wherein, when obtaining the double-channel audio frequency signal of scene known to multiple and different types, on the one hand, electronic equipment can be by
Related technical personnel carry into multiple different types of known scenes, and in the scene of each known type, trigger electronics
The acquisition of equipment progress audio signal.On the other hand, electronic equipment passes through two Mike's elegances when triggering obtains audio signal
Collect the first preset duration (can be configured according to actual needs suitable duration, for example, being configurable to 5 points by those skilled in the art
Clock) double-channel audio frequency signal;Referring to figure 4., after the double-channel audio frequency signal for collecting the first preset duration, field is provided
Scape type information input interface, and the scenetype information (scene inputted is received by the scenetype information input interface
Type information is inputted by related technical personnel, is carried out in subway carriage for example, carrying electronic equipment in related technical personnel
It, then can be using input scene type information as subway carriage scene when audio signal sample);In the scene type letter for receiving input
After breath, collected double-channel audio frequency signal is associated with by electronic equipment with the scenetype information received.
The available double-channel audio frequency signal to scene known to corresponding different type of electronic equipment as a result, for example, dining room
The audio of scene known to the different types such as scene, subway carriage scene, bus scene, office scenarios and street scene is believed
Number.
In addition, for same type scene, can be obtained when obtaining the double-channel audio frequency signal of scene known to different type
Take the type scene preset quantity (can configure according to actual needs suitable quantity by those skilled in the art, for example, can match
It is set to 50) double-channel audio frequency signal, for example, available same bus is in the double of different periods for bus scene
Channel audio signal gets 50 double-channel audio frequency signals of the bus altogether, can also obtain the binary channels of different buses
Audio signal gets the double-channel audio frequency signal etc. of 50 buses altogether.
It should be noted that can create when obtaining a plurality of double-channel audio frequency signal of same type scene to receive
The file for the scenetype information name arrived, the same type of a plurality of double-channel audio frequency signal that will acquire are stored in same text
In part folder.
In the embodiment of the present application, electronic equipment the double-channel audio frequency signal for getting scene known to multiple and different types it
Afterwards, the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is further extracted, it is more to construct correspondence
The first sample set of scene known to a different type.
For example, referring to figure 5., by taking the audio signal all the way in double-channel audio frequency signal as an example, electronic equipment is first to this
Road audio signal is pre-processed, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) indicates that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, and a is correction factor, generally
Take 0.95-0.97;Then framing windowing process is carried out to filtered audio signal, is obtained with smooth aforementioned audio signal framing
Audio frame edge, such as use Hamming window form adding windowThen, to adding window
Audio frame afterwards carries out Fourier transformation, such as Fast Fourier Transform (FFT), then carries out the extraction of mel-frequency cepstrum coefficient,
In, Fourier transformation result is filtered by Meier filter group, obtains the mel-frequency for meeting human auditory system habit, so
After take the logarithm by Conversion of measurement unit to be difference, mathematic(al) representation isWherein Fmel(f) it indicates to obtain
The mel-frequency arrived, f are the frequency point after Fourier transformation.Then, electronic equipment carries out discrete cosine to mel-frequency is got
Transformation, obtains mel-frequency cepstrum coefficient.Correspondingly, electronic equipment will extract bilateral for any double-channel audio frequency signal
The mel-frequency cepstrum coefficient in road.
After the mel-frequency cepstrum coefficient that extraction obtains the double-channel audio frequency signal of all types of known scenes, electronics is set
It is standby will the corresponding scenetype information association of each twin-channel mel-frequency cepstrum coefficient, it is multiple and different with building correspondence
The first sample set of scene known to type.
After building obtains first sample set, electronic equipment further constructs the residual error convolutional neural networks mould of initialization
Type, and the training for having supervision is carried out based on residual error convolutional neural networks model of the first sample set to building, after being trained
Residual error convolutional neural networks model.
For example, structure based on electronic equipment Resnet-50, the input dimension for being inputted vector dimension and data is kept
It is identical, it modifies to the node of last classification layer and is allowed to be equal to all categories quantity, the residual error thus initialized
Convolutional neural networks.
In 202, the double-channel audio frequency signal of scene known to multiple and different types is synthesized single-pass respectively by electronic equipment
Audio channel signal, and lightweight convolutional Neural net is obtained according to the training of the channel audio signal of scene known to multiple and different types
Network model.
Wherein, electronic equipment is in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones
Later, the double-channel audio frequency signal of scene known to multiple and different types is also synthesized into channel audio signal respectively, thus
To the channel audio signal of scene known to multiple and different types.
Then, electronic equipment further extracts after the channel audio signal that synthesis obtains all types of known scenes
Each channel energy regularization feature of the channel audio signal of all types of known scenes, to construct corresponding multiple and different types
Second sample set of known scene.
For example, please refer to Fig. 6, by taking certain channel audio signal as an example, electronic equipment first to channel audio signal into
Row pretreatment, for example, carrying out high-pass filtering, mathematic(al) representation are as follows: H (z)=1-az to the road audio signal-1, wherein H (z) table
Show that filtered aforementioned audio signal, z indicate the aforementioned audio signal before filtering, a is correction factor, generally takes 0.95-0.97;
Then framing windowing process is carried out to filtered audio signal, with the side for the audio frame that smooth aforementioned audio signal framing obtains
Edge, for example use the form adding window of Hamming windowThen, to the audio frame after adding window into
Row Fourier transformation, such as Fast Fourier Transform (FFT), then carry out the extraction of mel-frequency cepstrum coefficient, wherein it is filtered by Meier
Wave device group is filtered Fourier transformation result, obtains the mel-frequency for meeting human auditory system habit, then takes logarithm will be single
Position is converted to difference, and mathematic(al) representation isWherein Fmel(f) the Meier frequency got is indicated
Rate, f are the frequency point after Fourier transformation.Then, electronic equipment is smoothed to mel-frequency is got, mathematic(al) representation
Sharpening result is indicated for M (t, f)=(1-s) M (t-1, f)+sE (t, f), M (t, f), passes through the weight of each audio frame in timing
S obtains to be adjusted synthesis, wherein t, and f respectively indicates time and frequency.Finally, electronic equipment carries out each lead to sharpening result
The extraction of road energy regularization feature, mathematic(al) representation areμ is positive number minimum in order to avoid divisor is 0, parameterIt is the dynamic parameter that can learn.
After each channel energy regularization feature that extraction obtains the channel audio signal of all types of known scenes, electricity
The corresponding scenetype information association of each channel energy regularization feature that sub- equipment will be extracted, it is more with building correspondence
Second sample set of scene known to a different type.
After building obtains the second sample set, electronic equipment further constructs the lightweight convolutional neural networks of initialization
Model, and processing is optimized to the lightweight convolutional neural networks model of building, the lightweight convolutional Neural after being optimized
Network model, then the training for having supervision is carried out to the lightweight convolutional neural networks model after optimization based on the second sample set, it obtains
Lightweight convolutional neural networks model after to training.
For example, electronic equipment structure based on Xception network, processing is optimized to it, so that it passes through separation
Convolution learn on 36 convolutional layers, and operates in all pondizations of the 32nd layer, 34 layers and 36 layers progress, and by three kinds of spies
Sign carries out feature synthesis and carries out last classification.Further, it is also possible to (such as using scene Focalloss bad to classifying quality
The scenes such as park) compensate formula training.Model training and convergence are finally carried out in deep learning frame tensorflow, and
Accuracy test is carried out after training and carries out quantization compression.
In 203, electronic equipment carries out audio collection to scene to be identified by two microphones, obtains dual-channel audio
Signal.
Wherein, scene to be identified can be the scene that electronic equipment is presently in.Electronic equipment passes through two Mikes first
Wind carries out audio collection to scene to be identified, for example, electronic equipment can when the scene being presently in is set as scene to be identified
Audio collection is carried out to the scene being presently in synchronize by two microphones, obtains the identical dual-channel audio letter of duration
Number.
In 204, electronic equipment calls residual error convolutional neural networks model after training, the binary channels based on scene to be identified
Audio signal obtains the first scene classification result of scene to be identified.
Electronic equipment further extracts dual-channel audio after the double-channel audio frequency signal for collecting scene to be identified
The mel-frequency cepstrum coefficient of signal, and the mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted is inputted into training
Residual error convolutional neural networks model afterwards, multiple scene classification knots of the residual error convolutional neural networks model output after being trained
Fruit and its corresponding probability value;The most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities
When value, electronic equipment by after training residual error convolutional neural networks model export the corresponding scene classification result of most probable value
It is set as the first alternate scenes classification results.
In 205, the double-channel audio frequency signal of scene to be identified is synthesized channel audio signal by electronic equipment, and is adjusted
With the lightweight convolutional neural networks model after training, the channel audio signal based on scene to be identified obtains scene to be identified
The second scene classification result.
In addition, the double-channel audio frequency signal of scene to be identified is also synthesized channel audio signal by electronic equipment, and mention
Each channel energy regularization feature for taking channel audio signal, by each channel energy regularization feature of channel audio signal
If the lightweight convolutional neural networks model after input training, the lightweight convolutional neural networks after being trained export multiple
Scene classification result and its corresponding probability value;The most probable value of lightweight convolutional neural networks model output after training
When reaching predetermined probabilities value, by the corresponding scene point of most probable value of the lightweight convolutional neural networks model output after training
Class result is set as the second alternate scenes classification results.
In 206, electronic equipment judges whether the first alternate scenes classification results and the classification of the second alternate scenes are identical
Scene classification as a result, being that identical scene classification result is set as target scene classification results.
Wherein, electronic equipment judges whether the first alternate scenes classification results and the classification of the second alternate scenes are identical field
Scape classification results, if the first alternate scenes classification results and the second alternate scenes be classified as identical scene classification as a result, if electricity
The identical scene classification result is set as the target scene classification results of scene to be identified by sub- equipment.In addition, if first is candidate
Scene classification result and the second alternate scenes are classified as identical scene classification as a result, electronic equipment judges when secondary to field to be identified
The identification operation failure of scape, the double-channel audio frequency signal reacquired to scene to be identified are identified.
For example, the first candidate classification result is " subway scene ", the second candidate classification result is also " subway scene ", electronics
Equipment will the target scene classification results of " subway scene " as scene to be identified.
In one embodiment, a kind of scene Recognition device is additionally provided.Fig. 8 is please referred to, Fig. 8 provides for the embodiment of the present application
Scene Recognition device structural schematic diagram.Wherein the scene Recognition device is applied to electronic equipment, which includes two
A microphone, the scene Recognition device include audio collection module 301, the first categorization module 302, audio synthesis module 303,
Module 305 is integrated in two categorization modules 304 and classification, wherein as follows:
Audio collection module 301 obtains binary channels for carrying out audio collection to scene to be identified by two microphones
Audio signal;
First categorization module 302, for extracting the first of double-channel audio frequency signal according to the first default feature extraction strategy
Acoustic feature, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the
One alternate scenes classification results;
Audio synthesis module 303 obtains single channel audio letter for carrying out audio synthesis processing to double-channel audio frequency signal
Number;
Second categorization module 304, for extracting the second of channel audio signal according to the second default feature extraction strategy
Acoustic feature, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the
Two alternate scenes classification results;
Module 305 is integrated in classification, is used for according to the first alternate scenes classification results and the second alternate scenes classification results,
Obtain the target scene classification results of scene to be identified.
In one embodiment, audio synthesis processing is being carried out to double-channel audio frequency signal, when obtaining channel audio signal,
Audio synthesis module 303 is used for:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
In one embodiment, it according to the first alternate scenes classification results and the second alternate scenes classification results, obtains
When the target scene classification results of scene to be identified, classification is integrated module 305 and is used for:
Judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification result;
If so, identical scene classification result is set as target scene classification results.
In one embodiment, scene Recognition device further includes model training module, is passing through two microphones to be identified
Before scene carries out audio collection, it is used for:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple and different
The first sample set of scene known to type;
Residual error convolutional neural networks model is constructed, and residual error convolutional neural networks model is instructed according to first sample set
Practice, the residual error convolutional neural networks model after training is set as the first scene classification model.
In one embodiment, special in the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy
Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate
When scene classification result, the first categorization module 302 is used for:
The mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
By the residual error convolutional Neural net after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted
Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability
Value;
When the most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities value, it will train
The corresponding scene classification result of most probable value of residual error convolutional neural networks model output afterwards is set as the first alternate scenes point
Class result.
In one embodiment, in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones
Later, model training module is also used to:
The double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
Extract each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize, building
Second sample set of scene known to corresponding multiple and different types;
Lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model, is obtained
Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after optimization is trained according to the second sample set, by the light weight after training
Change convolutional neural networks model and is set as the second scene classification model.
In one embodiment, special in the second acoustics for extracting channel audio signal according to the second default feature extraction strategy
Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate
When scene classification result, the second categorization module 303 is used for:
Each channel energy regularization feature for extracting channel audio signal, by each channel energy of channel audio signal
Regularization feature is set as the second acoustic feature;
By the lightweight convolutional neural networks after each channel energy regularization feature input training of channel audio signal
Model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
When the most probable value of lightweight convolutional neural networks model output after training reaches predetermined probabilities value, it will instruct
The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after white silk is set as the second candidate field
Scape classification results.
It should be noted that the audio verification side in scene Recognition device provided by the embodiments of the present application and foregoing embodiments
It is owned by France that either offer method in audio method of calibration embodiment can be run on apparatus for processing audio in same design,
Specific implementation process is detailed in characteristic-acquisition method embodiment, and details are not described herein again.
In one embodiment, a kind of electronic equipment is also provided.Fig. 9 is please referred to, electronic equipment includes processor 401, storage
Device 402 and two microphones 403.
Processor 401 in the embodiment of the present application is general processor, such as the processor of ARM framework.
It is stored with computer program in memory 402, can be high-speed random access memory, can also be non-volatile
Property memory, such as at least one disk memory, flush memory device or other volatile solid-state parts etc..Correspondingly,
Memory 402 can also include Memory Controller, to provide access of the processor 401 to computer program in memory 402,
It implements function such as:
Audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance
The first experienced scene classification model is based on the first acoustic feature and carries out scene classification, obtains the first alternate scenes classification results;
Audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance
The second experienced scene classification model is based on the second acoustic feature and carries out scene classification, obtains the second alternate scenes classification results;
According to the first alternate scenes classification results and the second alternate scenes classification results, the target of scene to be identified is obtained
Scene classification result.
Please refer to Figure 10, Figure 10 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application, and shown in Fig. 6
The difference of electronic equipment is that electronic equipment further includes the components such as input unit 404 and output unit 405.
Wherein, input unit 404 can be used for receiving the number of input, character information or user's characteristic information (for example refer to
Line), and to generate related with user setting and function control keyboard, mouse, operating stick, optics or trackball signal defeated
Enter.
Output unit 405 can be used for showing information input by user or the information for being supplied to user, such as screen.
In the embodiment of the present application, processor 401 in electronic equipment can according to following step, by one or one with
On computer program process it is corresponding instruction be loaded into memory 402, and by processor 501 operation be stored in memory
Computer program in 402, thus realize various functions, it is as follows:
Audio collection is carried out to scene to be identified by two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance
The first experienced scene classification model is based on the first acoustic feature and carries out scene classification, obtains the first alternate scenes classification results;
Audio synthesis processing is carried out to double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance
The second experienced scene classification model is based on the second acoustic feature and carries out scene classification, obtains the second alternate scenes classification results;
According to the first alternate scenes classification results and the second alternate scenes classification results, the target of scene to be identified is obtained
Scene classification result.
In one embodiment, audio synthesis processing is being carried out to double-channel audio frequency signal, when obtaining channel audio signal,
Processor 501 can execute:
Double-channel audio frequency signal is synthesized to obtain channel audio signal according to default beamforming algorithm.
In one embodiment, it according to the first alternate scenes classification results and the second alternate scenes classification results, obtains
When the target scene classification results of scene to be identified, processor 501 can be executed:
Judge whether the first alternate scenes classification results and the classification of the second alternate scenes are identical scene classification result;
If so, identical scene classification result is set as target scene classification results.
In one embodiment, before carrying out audio collection to scene to be identified by two microphones, processor 501 can
To execute:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by two microphones;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes is extracted, building correspondence is multiple and different
The first sample set of scene known to type;
Residual error convolutional neural networks model is constructed, and residual error convolutional neural networks model is instructed according to first sample set
Practice, the residual error convolutional neural networks model after training is set as the first scene classification model.
In one embodiment, special in the first acoustics for extracting double-channel audio frequency signal according to the first default feature extraction strategy
Sign, and call the first scene classification model of training to be in advance based on the first acoustic feature and carry out scene classification, obtain the first candidate
When scene classification result, processor 501 be can also be performed:
The mel-frequency cepstrum coefficient for extracting double-channel audio frequency signal, is set as the first acoustic feature;
By the residual error convolutional Neural net after the mel-frequency cepstrum coefficient input training of the double-channel audio frequency signal extracted
Network model, multiple scene classification results of the residual error convolutional neural networks model output after being trained and its corresponding probability
Value;
When the most probable value of residual error convolutional neural networks model output after training reaches predetermined probabilities value, it will train
The corresponding scene classification result of most probable value of residual error convolutional neural networks model output afterwards is set as the first alternate scenes point
Class result.
In one embodiment, in the double-channel audio frequency signal for obtaining scene known to multiple and different types by two microphones
Later, processor 501 can also be performed:
The double-channel audio frequency signal of scene known to multiple and different types is synthesized into channel audio signal respectively;
Extract each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize, building
Second sample set of scene known to corresponding multiple and different types;
Lightweight convolutional neural networks model is constructed, and processing is optimized to lightweight convolutional neural networks model, is obtained
Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after optimization is trained according to the second sample set, by the light weight after training
Change convolutional neural networks model and is set as the second scene classification model.
In one embodiment, special in the second acoustics for extracting channel audio signal according to the second default feature extraction strategy
Sign, and call the second scene classification model of training to be in advance based on the second acoustic feature and carry out scene classification, obtain the second candidate
When scene classification result, processor 501 can be executed:
Each channel energy regularization feature for extracting channel audio signal, by each channel energy of channel audio signal
Regularization feature is set as the second acoustic feature;
By the lightweight convolutional neural networks after each channel energy regularization feature input training of channel audio signal
Model, multiple scene classification results of the lightweight convolutional neural networks output after being trained and its corresponding probability value;
When the most probable value of lightweight convolutional neural networks model output after training reaches predetermined probabilities value, it will instruct
The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after white silk is set as the second candidate field
Scape classification results.
It should be noted that the scene recognition method category in electronic equipment provided by the embodiments of the present application and foregoing embodiments
It, on an electronic device can be with either offer method in Run-time scenario recognition methods embodiment, specific implementation in same design
Process is detailed in feature extracting method embodiment, and details are not described herein again.
It should be noted that for the scene recognition method of the embodiment of the present application, this field common test personnel can be with
Understand all or part of the process for realizing the scene recognition method of the embodiment of the present application, is that can be controlled by computer program
Relevant hardware is completed, and the computer program can be stored in a computer-readable storage medium, be such as stored in electronics
In the memory of equipment, and by the electronic equipment processor and dedicated voice identification chip execute, in the process of implementation may be used
The process of embodiment including such as scene recognition method.Wherein, the storage medium can for magnetic disk, CD, read-only memory,
Random access memory etc..
Above to a kind of scene recognition method, device, storage medium and electronic equipment provided by the embodiment of the present application into
It has gone and has been discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above implementation
The explanation of example is merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, according to
According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification
It should not be construed as the limitation to the application.
Claims (10)
1. a kind of scene recognition method is applied to electronic equipment, which is characterized in that the electronic equipment includes two microphones,
The scene recognition method includes:
Audio collection is carried out to scene to be identified by described two microphones, obtains double-channel audio frequency signal;
The first acoustic feature of the double-channel audio frequency signal is extracted according to the first default feature extraction strategy, and calls instruction in advance
The first experienced scene classification model is based on first acoustic feature and carries out scene classification, obtains the first alternate scenes classification knot
Fruit;
Audio synthesis processing is carried out to the double-channel audio frequency signal, obtains channel audio signal;
The second acoustic feature of the channel audio signal is extracted according to the second default feature extraction strategy, and calls instruction in advance
The second experienced scene classification model is based on second acoustic feature and carries out scene classification, obtains the second alternate scenes classification knot
Fruit;
According to the first alternate scenes classification results and the second alternate scenes classification results, the field to be identified is obtained
The target scene classification results of scape.
2. scene recognition method according to claim 1, which is characterized in that described to be carried out to the double-channel audio frequency signal
Audio synthesis processing, obtains channel audio signal, comprising:
It synthesizes the double-channel audio frequency signal to obtain channel audio signal according to default beamforming algorithm.
3. scene recognition method according to claim 1, which is characterized in that described to be classified according to first alternate scenes
As a result and the second alternate scenes classification results, the target scene classification results of the scene to be identified are obtained, comprising:
Judge whether the first alternate scenes classification results and second alternate scenes classification are identical scene classification knot
Fruit;
If so, the identical scene classification result is set as the target scene classification results.
4. scene recognition method according to claim 1-3, which is characterized in that described to pass through described two Mikes
Wind carries out scene to be identified before audio collection, further includes:
The double-channel audio frequency signal of scene known to multiple and different types is obtained by described two microphones;
Extract the mel-frequency cepstrum coefficient of the double-channel audio frequency signal of all types of known scenes, the corresponding the multiple difference of building
The first sample set of scene known to type;
Construct residual error convolutional neural networks model, and according to the first sample set to the residual error convolutional neural networks model into
Row training, is set as the first scene classification model for the residual error convolutional neural networks model after training.
5. scene recognition method according to claim 4, which is characterized in that described according to the first default feature extraction strategy
The first acoustic feature of the double-channel audio frequency signal is extracted, and it is described to call the first scene classification model of training to be in advance based on
First acoustic feature carries out scene classification, obtains the first alternate scenes classification results, comprising:
The mel-frequency cepstrum coefficient for extracting the double-channel audio frequency signal is set as first acoustic feature;
The mel-frequency cepstrum coefficient of the double-channel audio frequency signal extracted is inputted to the residual error convolution mind after the training
Through network model, the multiple scene classification results and its correspondence of the residual error convolutional neural networks model output after obtaining the training
Probability value;
It, will be described when the most probable value of residual error convolutional neural networks model output after the training reaches predetermined probabilities value
The corresponding scene classification result of most probable value of residual error convolutional neural networks model output after training is set as the first candidate field
Scape classification results.
6. scene recognition method according to claim 4, which is characterized in that described more by the acquisition of described two microphones
After the double-channel audio frequency signal of scene known to a different type, further includes:
The double-channel audio frequency signal of scene known to the multiple different type is synthesized into channel audio signal respectively;
Each channel energy regularization feature for the channel audio signal that all types of known scenes synthesize is extracted, building corresponds to
Second sample set of scene known to the multiple different type;
Lightweight convolutional neural networks model is constructed, and processing is optimized to the lightweight convolutional neural networks model, is obtained
Lightweight convolutional neural networks model after to optimization;
The lightweight convolutional neural networks model after the optimization is trained according to second sample set, after training
Lightweight convolutional neural networks model is set as the second scene classification model.
7. scene recognition method according to claim 6, which is characterized in that extracted according to the second default feature extraction strategy
Second acoustic feature of the channel audio signal, and the second scene classification model of training in advance is called to be based on described second
Acoustic feature carries out scene classification, obtains the second alternate scenes classification results, comprising:
Each channel energy regularization feature for extracting the channel audio signal, by each channel of the channel audio signal
Energy regularization feature is set as second acoustic feature;
Each channel energy regularization feature of the channel audio signal is inputted into the lightweight convolutional Neural after the training
Network model, multiple scene classification results of the lightweight convolutional neural networks output after obtaining the training and its corresponding general
Rate value;
When the most probable value of lightweight convolutional neural networks model output after the training reaches the predetermined probabilities value,
The corresponding scene classification result of most probable value of lightweight convolutional neural networks model output after the training is set as institute
State the second alternate scenes classification results.
8. a kind of scene Recognition device, it is applied to electronic equipment, which is characterized in that the scene Recognition device includes:
Audio collection module obtains binary channels sound for carrying out audio collection to scene to be identified by described two microphones
Frequency signal;
First categorization module, for extracting the first acoustics of the double-channel audio frequency signal according to the first default feature extraction strategy
Feature, and call the first scene classification model of training to be in advance based on first acoustic feature and carry out scene classification, obtain the
One alternate scenes classification results;
Audio synthesis module obtains channel audio signal for carrying out audio synthesis processing to the double-channel audio frequency signal;
Second categorization module, for extracting the second acoustics of the channel audio signal according to the second default feature extraction strategy
Feature, and call the second scene classification model of training to be in advance based on second acoustic feature and carry out scene classification, obtain the
Two alternate scenes classification results;
Module is integrated in classification, for according to the first alternate scenes classification results and second alternate scenes classification knot
Fruit obtains the target scene classification results of the scene to be identified.
9. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program is by processor tune
Used time executes scene recognition method as described in any one of claim 1 to 7.
10. a kind of electronic equipment, including processor and memory, the memory storage have computer program, which is characterized in that
The processor is by calling the computer program, for executing scene Recognition side as described in any one of claim 1 to 7
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910731749.6A CN110473568B (en) | 2019-08-08 | 2019-08-08 | Scene recognition method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910731749.6A CN110473568B (en) | 2019-08-08 | 2019-08-08 | Scene recognition method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110473568A true CN110473568A (en) | 2019-11-19 |
CN110473568B CN110473568B (en) | 2022-01-07 |
Family
ID=68510551
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910731749.6A Active CN110473568B (en) | 2019-08-08 | 2019-08-08 | Scene recognition method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110473568B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112750448A (en) * | 2020-08-07 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Sound scene recognition method, device, equipment and storage medium |
CN112767967A (en) * | 2020-12-30 | 2021-05-07 | 深延科技(北京)有限公司 | Voice classification method and device and automatic voice classification method |
CN113129917A (en) * | 2020-01-15 | 2021-07-16 | 荣耀终端有限公司 | Speech processing method based on scene recognition, and apparatus, medium, and system thereof |
WO2021189979A1 (en) * | 2020-10-26 | 2021-09-30 | 平安科技(深圳)有限公司 | Speech enhancement method and apparatus, computer device, and storage medium |
CN114220458A (en) * | 2021-11-16 | 2022-03-22 | 武汉普惠海洋光电技术有限公司 | Sound identification method and device based on array hydrophone |
WO2022233061A1 (en) * | 2021-05-07 | 2022-11-10 | Oppo广东移动通信有限公司 | Signal processing method, communication device, and communication system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593522A (en) * | 2009-07-08 | 2009-12-02 | 清华大学 | A kind of full frequency domain digital hearing aid method and apparatus |
US20180157920A1 (en) * | 2016-12-01 | 2018-06-07 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for recognizing obstacle of vehicle |
CN108764304A (en) * | 2018-05-11 | 2018-11-06 | Oppo广东移动通信有限公司 | scene recognition method, device, storage medium and electronic equipment |
CN108831505A (en) * | 2018-05-30 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | The method and apparatus for the usage scenario applied for identification |
CN110082135A (en) * | 2019-03-14 | 2019-08-02 | 中科恒运股份有限公司 | Equipment fault recognition methods, device and terminal device |
-
2019
- 2019-08-08 CN CN201910731749.6A patent/CN110473568B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593522A (en) * | 2009-07-08 | 2009-12-02 | 清华大学 | A kind of full frequency domain digital hearing aid method and apparatus |
US20180157920A1 (en) * | 2016-12-01 | 2018-06-07 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for recognizing obstacle of vehicle |
CN108764304A (en) * | 2018-05-11 | 2018-11-06 | Oppo广东移动通信有限公司 | scene recognition method, device, storage medium and electronic equipment |
CN108831505A (en) * | 2018-05-30 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | The method and apparatus for the usage scenario applied for identification |
CN110082135A (en) * | 2019-03-14 | 2019-08-02 | 中科恒运股份有限公司 | Equipment fault recognition methods, device and terminal device |
Non-Patent Citations (1)
Title |
---|
李琪: "基于深度学习的音频场景识别方法研究", 《硕士学位论文》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129917A (en) * | 2020-01-15 | 2021-07-16 | 荣耀终端有限公司 | Speech processing method based on scene recognition, and apparatus, medium, and system thereof |
CN112750448A (en) * | 2020-08-07 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Sound scene recognition method, device, equipment and storage medium |
CN112750448B (en) * | 2020-08-07 | 2024-01-16 | 腾讯科技(深圳)有限公司 | Sound scene recognition method, device, equipment and storage medium |
WO2021189979A1 (en) * | 2020-10-26 | 2021-09-30 | 平安科技(深圳)有限公司 | Speech enhancement method and apparatus, computer device, and storage medium |
CN112767967A (en) * | 2020-12-30 | 2021-05-07 | 深延科技(北京)有限公司 | Voice classification method and device and automatic voice classification method |
WO2022233061A1 (en) * | 2021-05-07 | 2022-11-10 | Oppo广东移动通信有限公司 | Signal processing method, communication device, and communication system |
CN114220458A (en) * | 2021-11-16 | 2022-03-22 | 武汉普惠海洋光电技术有限公司 | Sound identification method and device based on array hydrophone |
CN114220458B (en) * | 2021-11-16 | 2024-04-05 | 武汉普惠海洋光电技术有限公司 | Voice recognition method and device based on array hydrophone |
Also Published As
Publication number | Publication date |
---|---|
CN110473568B (en) | 2022-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110473568A (en) | Scene recognition method, device, storage medium and electronic equipment | |
US10621971B2 (en) | Method and device for extracting speech feature based on artificial intelligence | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN108564963B (en) | Method and apparatus for enhancing voice | |
CN103426435B (en) | The source by independent component analysis with mobile constraint separates | |
CN110503971A (en) | Time-frequency mask neural network based estimation and Wave beam forming for speech processes | |
JP2019522810A (en) | Neural network based voiceprint information extraction method and apparatus | |
US20160189730A1 (en) | Speech separation method and system | |
CN110428842A (en) | Speech model training method, device, equipment and computer readable storage medium | |
CN108364662B (en) | Voice emotion recognition method and system based on paired identification tasks | |
CN110444202B (en) | Composite voice recognition method, device, equipment and computer readable storage medium | |
CN112949708B (en) | Emotion recognition method, emotion recognition device, computer equipment and storage medium | |
CN112581978A (en) | Sound event detection and positioning method, device, equipment and readable storage medium | |
CN110400571A (en) | Audio-frequency processing method, device, storage medium and electronic equipment | |
CN112562648A (en) | Adaptive speech recognition method, apparatus, device and medium based on meta learning | |
Ting Yuan et al. | Frog sound identification system for frog species recognition | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN111312223A (en) | Training method and device of voice segmentation model and electronic equipment | |
CN110169082A (en) | Combining audio signals output | |
WO2024114303A1 (en) | Phoneme recognition method and apparatus, electronic device and storage medium | |
CN117542373A (en) | Non-air conduction voice recovery system and method | |
CN111863035A (en) | Method, system and equipment for recognizing heart sound data | |
CN110136741A (en) | A kind of single-channel voice Enhancement Method based on multiple dimensioned context | |
CN114333769B (en) | Speech recognition method, computer program product, computer device and storage medium | |
CN109637555A (en) | A kind of business meetings japanese voice identification translation system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |