CN111241336A - Audio scene recognition method and device, electronic equipment and medium - Google Patents

Audio scene recognition method and device, electronic equipment and medium Download PDF

Info

Publication number
CN111241336A
CN111241336A CN202010015772.8A CN202010015772A CN111241336A CN 111241336 A CN111241336 A CN 111241336A CN 202010015772 A CN202010015772 A CN 202010015772A CN 111241336 A CN111241336 A CN 111241336A
Authority
CN
China
Prior art keywords
audio
audio data
event detection
data
detection result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010015772.8A
Other languages
Chinese (zh)
Inventor
陈剑超
肖龙源
李稀敏
蔡振华
刘晓葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010015772.8A priority Critical patent/CN111241336A/en
Publication of CN111241336A publication Critical patent/CN111241336A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation

Abstract

The application provides an audio scene recognition method and device, electronic equipment and a computer readable medium. Wherein the method comprises the following steps: receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.

Description

Audio scene recognition method and device, electronic equipment and medium
Technical Field
The present application relates to the field of audio recognition technologies, and in particular, to an audio scene recognition method and apparatus, an electronic device, and a computer-readable medium.
Background
Audio scene recognition refers to recognizing, for one audio data, an environment in which the audio data occurs, or in other words, audio scene recognition refers to perceiving a surrounding environment through audio information. The audio scene recognition technology has very wide application value, and the audio scene recognition technology can enable the equipment to well sense the surrounding environment when being used for the mobile terminal equipment, so that the equipment state can be automatically adjusted.
The audio retrieval technology based on text is to store each piece of audio as an object in a database, generally label the audio by audio name (keyword) and text information, and the audio retrieval is to perform precise search or fuzzy search according to the keyword description of the audio. Therefore, the text-based audio retrieval technology is carried out through the text retrieval technology, and audio information does not play any role in the whole retrieval process. Most audio retrieval systems are text-based audio retrieval, for example, all music search engines currently use the text-based audio retrieval method to retrieve audio.
In real life, the sounds we have come into contact with are extremely wide, from the sounds of nature such as wind and rain, the sounds of animals, and the sounds of running water, to the sounds of life such as machine roaring, the sounds of automobile engines, and the sounds of various audios, voices, and synthesized sounds coming into contact with computers.
In the traditional audio retrieval technology based on the text, the audio in the audio library needs to be summarized and annotated in advance manually, and the audio retrieval result also completely depends on the manually annotated information such as audio name, number, annotation and the like. However, because each section of audio needs to be labeled with characters, if the audio database is large, the labeling needs a lot of manual labor, and the character labeling has strong human subjectivity, and different people may have different opinions for the same section of audio labeling, so that the labeling information is inconsistent, and a small number of characters are difficult to fully express the connotation of one audio.
Disclosure of Invention
The application aims to provide an audio scene identification method and device, electronic equipment and a computer readable medium.
A first aspect of the present application provides an audio scene recognition method, including:
receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments;
performing audio event detection based on the plurality of audio clips to obtain an audio event detection result;
and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.
In some possible implementations, the audio segmenting the audio data to form a plurality of audio segments includes:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
In some possible implementations, the performing audio event detection based on the plurality of audio segments to obtain an audio event detection result includes:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
In some possible implementation manners, the performing scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model includes:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
A second aspect of the present application provides an audio scene recognition apparatus, including:
the segmentation module is used for receiving audio data and performing audio segmentation on the audio data to form a plurality of audio segments;
the detection module is used for detecting audio events based on the plurality of audio clips to obtain audio event detection results;
and the recognition module is used for carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.
In some possible implementations, the segmentation module is specifically configured to:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
In some possible implementations, the detection module is specifically configured to:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
In some possible implementations, the identification module is specifically configured to:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
A third aspect of the present application provides an electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program when executing the computer program to perform the method of the first aspect of the application.
A fourth aspect of the present application provides a computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of the first aspect of the present application.
Compared with the prior art, the audio scene identification method, the audio scene identification device, the electronic equipment and the medium receive audio data, and perform audio segmentation on the audio data to form a plurality of audio segments; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 illustrates a flow chart of an audio scene recognition method provided by some embodiments of the present application;
FIG. 2 illustrates a flow chart of an audio scene recognition method provided by some embodiments of the present application;
FIG. 3 illustrates a flow diagram for spectral clustering of segmented audio segments provided by some embodiments of the present application;
fig. 4 shows a schematic diagram of an audio scene recognition apparatus provided by some embodiments of the present application;
fig. 5 illustrates a schematic diagram of an electronic device provided by some embodiments of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
In addition, the terms "first" and "second", etc. are used to distinguish different objects, rather than to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Audio retrieval can be attributed to a pattern matching problem. An audio retrieval system typically includes two stages, a training stage (database generation) and a pattern matching (database query).
The first step in audio retrieval is to build a database: and extracting the characteristics of the audio data, loading the audio data into an original audio library part of the database, loading the characteristics into a characteristic library part, clustering the audio data through the characteristics, and loading the clustering information into a clustering parameter library part. The database is built and the audio information can be retrieved.
The audio retrieval technology mainly adopts a characteristic sample query mode, a user determines a sample through a query interface and sets an attribute value, then the query is submitted, a system extracts characteristics from the sample, determines a query characteristic vector by combining the attribute value, then a retrieval engine matches the characteristic vector with a clustering parameter set, matches a certain amount of corresponding data in a characteristic library and an original audio library according to the sequence of the correlation from large to small, and returns the data to the user through the query interface. The original audio library stores audio data, the feature library stores features of the audio data and stores the features according to records, and the clustering parameter library is a parameter set obtained by clustering the audio features.
The embodiment of the application provides an audio scene recognition method and device, an electronic device and a computer readable medium, which are described below with reference to the accompanying drawings.
Referring to fig. 1 and fig. 2, fig. 1 shows a flowchart of an audio scene recognition method provided in some embodiments of the present application, and fig. 2 shows a flowchart of an audio scene recognition method provided in some embodiments of the present application, and as shown in fig. 1, the audio scene recognition method may include the following steps:
step S101: receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments;
specifically, the scene sound is generally composed of a structured foreground sound and an unstructured background sound, and the segmentation based on the scene change is performed based on the unstructured background sound. A background sound is first identified from the test audio and then the likely boundaries of the test audio are determined from the background sound. This algorithm avoids feature extraction of the underlying audio. The segmentation of the audio data is achieved by sampling values of the original audio data.
In this embodiment, step S101 may be specifically implemented as:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
Specifically, in the audio segmentation process, a section of background sound in the test audio is found out through a simple background sound identification algorithm; then extracting waveform values of the test audio to form a matrix, and projecting the matrix and the background sound feature vector into a feature space; and finally, calculating the normalized distance between the test audio characteristic vector and the background sound characteristic vector to determine the position of the segmentation point of the test audio segment, thereby realizing audio segmentation.
Step S102: performing audio event detection based on the plurality of audio clips to obtain an audio event detection result;
in this embodiment, step S102 may be specifically implemented as:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
Specifically, in the audio event detection process, similar segmented segments in the test audio are clustered together through a spectral clustering algorithm, and the clustered segments are called audio events. There are 4 attributes of an audio event, namely frequency of occurrence, total length of time, importance and label. The frequency of occurrence and the total length of time are relatively easy to calculate, and the importance of an audio event is calculated by adding these two attributes to the average length of each audio piece in the audio event. The labels of the audio events are subjected to Singular Value Decomposition (SVD) to calculate principal feature vectors (DFVs) thereof, and then the similarity between the audio events and the training audio events is calculated through the principal feature vectors, and the label corresponding to the training audio event with the largest similarity value is the label of the test audio event, as shown in fig. 2.
And (3) spectral clustering algorithm: the method comprises the steps of using two times of spectral clustering, finding out principal characteristic vectors of the segmented audio segments by a first spectral clustering algorithm, then using the mean value of the principal characteristic vectors as the characteristic vectors to represent the segmented audio segments, and clustering the principal characteristic vectors by a second spectral clustering algorithm, thereby greatly improving the clustering result. Fig. 3 is a flow chart of spectral clustering of segmented audio segments. In FIG. 3, the segment (S) is divided for audio1To SN) Extracting the features to obtain a feature vector (x)ij) Performing a first spectral clustering to obtain Ci1The feature vectors are equal, and then the main feature vector is calculated to obtain (V)1To VN) Then, performing a second spectral clustering to obtain a feature vector C1To Cks
Principal feature vector representation: each audio event usually comprises a plurality of audio segments, each audio segment can be represented by a feature vector through a primary spectral clustering algorithm, the audio event is represented by a feature matrix formed by the feature vectors corresponding to the audio segments, and the audio event is represented by Dominant Feature Vectors (DFVs) in order to represent the prominent features of the audio event. The dominant eigenvector is calculated by SVD on the eigenspace of the audio event.
Step S103: and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.
The preset recognition model may be an HMM model (hidden markov model).
In this embodiment, step S103 may be specifically implemented as:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
Specifically, in the audio scene recognition, firstly, an audio segment labeled with an error in an audio event is corrected through a context model. The method mainly considers the occurrence order and the occurrence frequency among audio events to construct an audio event occurrence probability matrix, then calculates the probability of the simultaneous occurrence of two audio events, can calculate the occurrence probability relation of the audio events in a certain specific scene for a given audio scene training set, usually calculates a likelihood function, and finally maximizes the likelihood function, thereby finding the optimal audio event occurrence probability matrix in the training set. The labeling of the wrong audio piece in the audio event is then corrected by a greedy search algorithm. Then, the audio events are sequenced according to the importance, the audio events are expanded on an event axis according to a descending sequence to obtain audio scene data, and finally the recognition and classification of the audio scenes are achieved through an HMM model algorithm.
The audio scene recognition method can be applied to a client, and in the embodiment of the application, the client may include hardware or software. When the client includes hardware, it may be various electronic devices having a display screen and supporting information interaction, for example, and may include, but not be limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the client includes software, it may be installed in the electronic device, and it may be implemented as a plurality of software or software modules, or as a single software or software module. And is not particularly limited herein.
Compared with the prior art, the audio scene identification method provided by the embodiment of the application forms a plurality of audio segments by receiving audio data and performing audio segmentation on the audio data; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.
In the foregoing embodiment, an audio scene recognition method is provided, and correspondingly, an audio scene recognition apparatus is also provided in the present application. The audio scene recognition device provided by the embodiment of the application can implement the audio scene recognition method, and the audio scene recognition device can be implemented through software, hardware or a combination of software and hardware. For example, the audio scene recognition means may comprise integrated or separate functional modules or units to perform the corresponding steps of the above-described methods. Please refer to fig. 4, which illustrates a schematic diagram of an audio scene recognition apparatus according to some embodiments of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
As shown in fig. 4, the audio scene recognition apparatus 10 may include:
the segmentation module 101 is configured to receive audio data, and perform audio segmentation on the audio data to form a plurality of audio segments;
the detection module 102 is configured to perform audio event detection based on the plurality of audio segments to obtain an audio event detection result;
and the identification module 103 is configured to perform scene identification and labeling on the audio data according to the audio event detection result and a preset identification model.
In some possible implementations, the segmentation module 101 is specifically configured to:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
In some possible implementations, the detection module 102 is specifically configured to:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
In some possible implementations, the identification module 103 is specifically configured to:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
The audio scene recognition device 10 provided in the embodiment of the present application and the audio scene recognition method provided in the foregoing embodiment of the present application have the same beneficial effects and the same inventive concepts.
The embodiment of the present application further provides an electronic device corresponding to the audio scene recognition method provided by the foregoing embodiment, where the electronic device may be an electronic device for a client, such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, and the like, so as to execute the audio scene recognition method.
Please refer to fig. 5, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 5, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the audio scene recognition method provided by any one of the foregoing embodiments when executing the computer program.
The Memory 201 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used for storing a program, and the processor 200 executes the program after receiving an execution instruction, and the audio scene recognition method disclosed in any embodiment of the present application may be applied to the processor 200, or implemented by the processor 200.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200.
The electronic device provided by the embodiment of the application and the audio scene identification method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
The present application further provides a computer-readable medium corresponding to the audio scene recognition method provided in the foregoing embodiments, and a computer program (i.e., a program product) is stored thereon, and when being executed by a processor, the computer program executes the audio scene recognition method provided in any of the foregoing embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.

Claims (10)

1. An audio scene recognition method, comprising:
receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments;
performing audio event detection based on the plurality of audio clips to obtain an audio event detection result;
and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.
2. The method of claim 1, wherein the audio segmenting the audio data into a plurality of audio segments comprises:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
3. The method of claim 2, wherein the performing audio event detection based on the plurality of audio segments to obtain an audio event detection result comprises:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
4. The method of claim 3, wherein the performing scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model comprises:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
5. An audio scene recognition apparatus, comprising:
the segmentation module is used for receiving audio data and performing audio segmentation on the audio data to form a plurality of audio segments;
the detection module is used for detecting audio events based on the plurality of audio clips to obtain audio event detection results;
and the recognition module is used for carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.
6. The apparatus of claim 5, wherein the segmentation module is specifically configured to:
inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;
extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;
calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;
and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.
7. The apparatus according to claim 6, wherein the detection module is specifically configured to:
according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;
the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.
8. The apparatus according to claim 7, wherein the identification module is specifically configured to:
according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;
and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method according to any of claims 1 to 4.
10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 4.
CN202010015772.8A 2020-01-07 2020-01-07 Audio scene recognition method and device, electronic equipment and medium Pending CN111241336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010015772.8A CN111241336A (en) 2020-01-07 2020-01-07 Audio scene recognition method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010015772.8A CN111241336A (en) 2020-01-07 2020-01-07 Audio scene recognition method and device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN111241336A true CN111241336A (en) 2020-06-05

Family

ID=70870342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010015772.8A Pending CN111241336A (en) 2020-01-07 2020-01-07 Audio scene recognition method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN111241336A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015942A (en) * 2020-08-28 2020-12-01 上海掌门科技有限公司 Audio processing method and device
CN113645439A (en) * 2021-06-22 2021-11-12 宿迁硅基智能科技有限公司 Event detection method and system, storage medium and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477798A (en) * 2009-02-17 2009-07-08 北京邮电大学 Method for analyzing and extracting audio data of set scene
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
CN104167211A (en) * 2014-08-08 2014-11-26 南京大学 Multi-source scene sound abstracting method based on hierarchical event detection and context model
US20190115045A1 (en) * 2017-10-12 2019-04-18 Qualcomm Incorporated Audio activity tracking and summaries

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477798A (en) * 2009-02-17 2009-07-08 北京邮电大学 Method for analyzing and extracting audio data of set scene
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
CN104167211A (en) * 2014-08-08 2014-11-26 南京大学 Multi-source scene sound abstracting method based on hierarchical event detection and context model
US20190115045A1 (en) * 2017-10-12 2019-04-18 Qualcomm Incorporated Audio activity tracking and summaries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王公友: ""基于内容的音频分析与场景识别"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015942A (en) * 2020-08-28 2020-12-01 上海掌门科技有限公司 Audio processing method and device
CN113645439A (en) * 2021-06-22 2021-11-12 宿迁硅基智能科技有限公司 Event detection method and system, storage medium and electronic device
CN113645439B (en) * 2021-06-22 2022-07-29 宿迁硅基智能科技有限公司 Event detection method and system, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN109117777B (en) Method and device for generating information
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
US10963504B2 (en) Zero-shot event detection using semantic embedding
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
WO2019242442A1 (en) Multi-model feature-based malware identification method, system and related apparatus
CN109271624B (en) Target word determination method, device and storage medium
CN111488468A (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN111159987A (en) Data chart drawing method, device, equipment and computer readable storage medium
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN111241336A (en) Audio scene recognition method and device, electronic equipment and medium
CN112988753A (en) Data searching method and device
CN107861948B (en) Label extraction method, device, equipment and medium
CN111738009B (en) Entity word label generation method, entity word label generation device, computer equipment and readable storage medium
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN114090792A (en) Document relation extraction method based on comparison learning and related equipment thereof
CN113268630A (en) Audio retrieval method, device and medium
CN114490993A (en) Small sample intention recognition method, system, equipment and storage medium
CN112735432B (en) Audio identification method, device, electronic equipment and storage medium
CN114218428A (en) Audio data clustering method, device, equipment and storage medium
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN112149389A (en) Resume information structured processing method and device, computer equipment and storage medium
CN111368083A (en) Text classification method, device and equipment based on intention confusion and storage medium
CN111540363B (en) Keyword model and decoding network construction method, detection method and related equipment
CN109241296A (en) Method and apparatus for generating information
CN113792608B (en) Intelligent semantic analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200605