CN115995014A

CN115995014A - Method for detecting loudspeaker monomer, method for detecting audio frequency and related device

Info

Publication number: CN115995014A
Application number: CN202111210712.2A
Authority: CN
Inventors: 杨伟明; 唐惠忠; 王少鸣; 郭润增
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2023-04-21

Abstract

The application discloses a loudspeaker monomer detection method based on artificial intelligence technology, comprising the following steps: acquiring audio data to be tested corresponding to the loudspeaker monomer to be tested, wherein the audio data to be tested is audio data played by the loudspeaker monomer to be tested; acquiring N frequency spectrum images according to audio data to be detected; extracting features of each spectrum image in the N spectrum images to obtain target audio features corresponding to each spectrum image; acquiring a target class label corresponding to each spectrum image through an audio detection model based on a target audio feature corresponding to each spectrum image; and determining the detection result of the loudspeaker monomer to be detected according to the target class label corresponding to each frequency spectrum image. The application also provides an audio detection method and a related device. The utility model has the advantages of this application not only can be adapted to different detection environment, can detect loudspeaker monomer in batches moreover to reduce detection cost, promote detection efficiency, be favorable to reducing the free detection degree of difficulty of loudspeaker.

Description

Method for detecting loudspeaker monomer, method for detecting audio frequency and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method for detecting a loudspeaker monomer, a method for detecting audio frequency and a related device.

Background

In the world of sound, a loudspeaker is a very important factor because it determines the quality and timbre of most of the emitted sound. Loudspeakers typically comprise a loudspeaker unit, which is a modern electroacoustic element, which converts an electrical signal into sound. According to the difference of the single sound production modes of the loudspeaker, the loudspeaker can be divided into a moving coil type, an inductance type, an electrostatic type, a plane type, a ribbon type, a horn type and the like.

Many devices on the market (e.g., face brushing devices, etc.) are produced with poor speaker monomer. In order to detect the quality of the horn cell. Currently, it is common practice in the prior art to extract a certain number of horn units in a metered amount and send them to an audio laboratory for inspection of the horn units.

However, the inventor finds that at least the following problems exist in the prior art, on one hand, equipment cost of an audio laboratory is high, a test environment is complex, on the other hand, a test period is long, and a tester needs to spend more time to perform detection, so that large-scale batch verification is difficult, and the cost and the detection difficulty of the existing detection method are high.

Disclosure of Invention

The embodiment of the application provides a method for detecting a loudspeaker monomer, a method for detecting audio frequency and a related device. The loudspeaker single body detection device can be suitable for different detection environments, and can detect loudspeaker single bodies in batches, so that the detection cost is reduced, the detection efficiency is improved, and the detection difficulty of the loudspeaker single bodies is reduced.

In view of this, the application provides a method for detecting a speaker monomer, which includes:

acquiring audio data to be tested corresponding to the loudspeaker monomer to be tested, wherein the audio data to be tested is audio data played by the loudspeaker monomer to be tested;

acquiring N frequency spectrum images according to audio data to be detected, wherein N is an integer greater than or equal to 1;

extracting features of each of the N frequency spectrum images to obtain target audio features corresponding to each frequency spectrum image, wherein the target audio features comprise at least one type of audio features;

acquiring a target class label corresponding to each spectrum image through an audio detection model based on a target audio feature corresponding to each spectrum image;

and determining the detection result of the loudspeaker monomer to be detected according to the target class label corresponding to each frequency spectrum image.

Another aspect of the present application provides a method for audio detection, including:

acquiring audio data to be tested;

and determining a detection result of the audio data to be detected according to the target class label corresponding to each frequency spectrum image.

Another aspect of the present application provides a horn monomer detection apparatus, comprising:

the acquisition module is used for acquiring the audio data to be detected corresponding to the loudspeaker monomer to be detected, wherein the audio data to be detected is the audio data played by the loudspeaker monomer to be detected;

the acquisition module is also used for acquiring N frequency spectrum images according to the audio data to be detected, wherein N is an integer greater than or equal to 1;

the extraction module is used for extracting the characteristics of each spectrum image in the N spectrum images to obtain target audio characteristics corresponding to each spectrum image, wherein the target audio characteristics comprise at least one type of audio characteristics;

The acquisition module is also used for acquiring a target category label corresponding to each spectrum image through an audio detection model based on the target audio characteristics corresponding to each spectrum image;

the determining module is used for determining the detection result of the loudspeaker monomer to be detected according to the target class label corresponding to each frequency spectrum image.

In one possible design, in another implementation of another aspect of the embodiments of the present application,

the acquisition module is specifically used for converting the audio data to be detected into an audio spectrogram;

dividing an audio frequency spectrogram according to a first preset duration to obtain N frequency spectrum images;

the extraction module is specifically configured to divide each spectrum image in the N spectrum images according to a second preset duration to obtain M spectrum sub-images of each spectrum image, where the second preset duration is smaller than the first preset duration, and M is an integer greater than 1;

aiming at each spectrum sub-image of each spectrum image, acquiring the audio characteristics of a unit to be processed of the spectrum sub-image;

and carrying out normalization processing on the to-be-processed unit audio features of each spectrum sub-image aiming at each spectrum sub-image of each spectrum image to obtain the unit audio features of the spectrum sub-image, wherein the unit audio features are contained in the target audio features.

the extraction module is specifically used for extracting the to-be-processed Mel spectrum characteristics of the spectrum sub-image by adopting a first fast Fourier transform size;

extracting the characteristic of the to-be-processed mel cepstrum coefficient of the spectrum sub-image by adopting a second fast Fourier transform size;

extracting zero crossing rate characteristics to be processed of the spectrum sub-image by adopting a third fast Fourier transform size;

extracting the flatness of the frequency spectrum to be processed of the frequency spectrum sub-image by adopting a fourth fast Fourier transform size;

and extracting the mass center characteristics of the spectrum to be processed of the spectrum sub-image by adopting a fifth fast Fourier transform size.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the horn monomer detection apparatus further includes a coding module;

the acquisition module is also used for acquiring format data corresponding to the audio data to be detected, wherein the format data comprises bit depth and sampling rate;

the coding module is used for coding the format data to obtain bit depth characteristics and sampling rate characteristics;

the acquisition module is specifically configured to acquire, through the audio detection model, a target class label corresponding to each spectrum image based on a target audio feature, a bit depth feature and a sampling rate feature corresponding to each spectrum image.

In one possible design, in another implementation of another aspect of the embodiments of the present application, each spectral image includes M spectral sub-images, the target audio feature includes M unit audio features, the unit audio features have a correspondence with the spectral sub-images, and M is an integer greater than 1;

the acquisition module is specifically used for acquiring a first feature map of each spectrum sub-image through a convolution network included in the audio detection model based on the unit audio feature of each spectrum sub-image aiming at each spectrum image;

acquiring a second feature map of each spectrum sub-image through an activation network included in the audio detection model based on the first feature map of each spectrum sub-image for each spectrum image;

for each spectrum image, acquiring a feature vector of each spectrum sub-image through a time sequence network included in the audio detection model based on a second feature map of each spectrum sub-image;

aiming at each spectrum image, acquiring class probability distribution of each spectrum sub-image through a full-connection layer included in the audio detection model based on the feature vector of each spectrum sub-image;

and determining a target category label corresponding to the spectrum image according to the category probability distribution of each spectrum sub-image aiming at each spectrum image.

aiming at each spectrum image, acquiring category probability distribution of each spectrum sub-image through a full-connection layer included in the audio detection model based on a second feature map of each spectrum sub-image;

In one possible design, in another implementation manner of another aspect of the embodiments of the present application, the speaker monomer detection apparatus further includes a processing module and a training module;

The acquisition module is further used for acquiring an audio data sample, wherein the audio data sample corresponds to one labeling category label;

the acquisition module is also used for acquiring P frequency spectrum image samples according to the audio data samples, wherein P is an integer greater than or equal to 1;

the acquisition module is further used for dividing each spectrum image sample in the P spectrum image samples according to a second preset time length to obtain M spectrum sub-image samples corresponding to each spectrum image sample;

the acquisition module is also used for acquiring the audio characteristics of the unit to be processed of each spectrum sub-image sample;

the processing module is used for carrying out normalization processing on the audio characteristics of the unit to be processed of each spectrum sub-image sample aiming at each spectrum sub-image sample of each spectrum sub-image sample to obtain the unit audio characteristics of the spectrum sub-image sample;

the acquisition module is also used for acquiring the category probability distribution of each spectrum sub-image sample through an audio detection model based on the unit audio characteristics of the M spectrum sub-image samples;

and the training module is used for updating model parameters of the audio detection model according to the labeling class labels and the class probability distribution of each spectrum sub-image sample aiming at each spectrum image sample.

the acquisition module is specifically used for determining the corresponding classification label of each spectrum sub-image according to the classification probability distribution of each spectrum sub-image aiming at each spectrum image;

for each spectrum image, determining the number of first normal labels according to the corresponding classification label of each spectrum sub-image;

for each spectrum image, determining a first normal label duty cycle according to the number of first normal labels;

for each spectrum image, if the first normal label duty ratio is greater than or equal to a first duty ratio threshold value, determining that the target class label corresponding to the spectrum image is a second normal label;

for each spectrum image, if the first normal label duty ratio is smaller than a first duty ratio threshold value, determining that the target class label corresponding to the spectrum image is a second abnormal label.

for each spectrum image, determining the average confidence coefficient belonging to the first normal label according to the class probability distribution of each spectrum sub-image;

if the duty ratio of the normal label is larger than or equal to the first duty ratio threshold value and the average confidence coefficient is larger than or equal to the confidence coefficient threshold value, determining the target class label corresponding to the frequency spectrum image as a second normal label;

if the duty ratio of the normal label is larger than or equal to the first duty ratio threshold and the average confidence coefficient is smaller than the confidence coefficient threshold, determining that the target class label corresponding to the frequency spectrum image is a second abnormal label;

if the normal label duty ratio is smaller than the first duty ratio threshold and the average confidence coefficient is larger than or equal to the confidence coefficient threshold, determining that the target class label corresponding to the frequency spectrum image is a second abnormal label;

if the normal label duty ratio is smaller than the first duty ratio threshold and the average confidence coefficient is smaller than the confidence coefficient threshold, determining the target class label corresponding to the frequency spectrum image as a second abnormal label.

the acquisition module is specifically used for acquiring a first feature map of each spectrum image through a convolution network included in the audio detection model based on the target audio feature corresponding to each spectrum image;

acquiring a second feature map of each spectrum image through an activation network included in the audio detection model based on the first feature map of each spectrum image;

and acquiring a target category label corresponding to each spectrum image through a full connection layer included in the audio detection model based on the second feature map of each spectrum image.

the determining module is specifically configured to determine the number of second normal tags according to the target category tags corresponding to each spectrum image;

determining a second normal label duty ratio according to the number of the second normal labels;

if the second normal label duty ratio is greater than or equal to a second duty ratio threshold value, determining that the detection result of the loudspeaker monomer to be detected is a normal result;

if the second normal label duty ratio is smaller than the second duty ratio threshold value, determining that the detection result of the to-be-detected loudspeaker monomer is an abnormal result.

Another aspect of the present application provides an audio detection apparatus, including:

the acquisition module is used for acquiring audio data to be detected;

and the determining module is used for determining the detection result of the audio data to be detected according to the target class label corresponding to each frequency spectrum image.

Another aspect of the present application provides a computer device comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and the processor is used for executing the method according to the aspects according to the instructions in the program code;

the bus system is used to connect the memory and the processor to communicate the memory and the processor.

Another aspect of the present application provides a computer-readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

In another aspect of the present application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the above aspects.

From the above technical solutions, the embodiments of the present application have the following advantages:

in the embodiment of the application, a method for detecting a speaker unit is provided, firstly, audio data to be detected corresponding to the speaker unit to be detected is obtained, the audio data to be detected is audio data played by the speaker unit to be detected, and then N frequency spectrum images are obtained according to the audio data to be detected. Thus, feature extraction can be performed on each spectrum image in the N spectrum images to obtain a target audio feature corresponding to each spectrum image, and a target class label corresponding to each spectrum image is obtained through an audio detection model based on the target audio feature corresponding to each spectrum image. And finally, combining the target class labels corresponding to each frequency spectrum image to determine the detection result of the loudspeaker monomer to be detected. Through the mode, the audio detection model is introduced to detect the loudspeaker monomers, so that the loudspeaker monomers can be adapted to different detection environments, and can be detected in batches, thereby reducing the detection cost, improving the detection efficiency and being beneficial to reducing the detection difficulty of the loudspeaker monomers.

Drawings

FIG. 1 is a schematic diagram of an implementation flow of a product lifecycle management system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speaker unit detection system according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of a method for detecting a speaker monomer according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of audio feature processing according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an audio spectrogram according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a convolutional network and a timing network in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an audio detection model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a gating cycle according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another structure of an audio detection model according to an embodiment of the present application;

FIG. 10 is a schematic flow chart of training an audio detection model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a loudspeaker unit detection based on an audio detection model in an embodiment of the present application;

FIG. 12 is a schematic diagram of another structure of an audio detection model according to an embodiment of the present application;

fig. 13 is a schematic flow chart of an audio detection method in an embodiment of the present application;

FIG. 14 is a schematic diagram of a speaker unit detection apparatus according to an embodiment of the present disclosure;

FIG. 15 is a schematic diagram of an audio detection device according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a terminal device in an embodiment of the present application;

fig. 17 is a schematic structural diagram of a server in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

The loudspeaker unit is a modern electroacoustic component and is used for converting an electric signal into sound. The types of the loudspeaker units are various, and the classification methods are different, and if the loudspeaker units are classified according to the electroacoustic conversion principle, cone basin units, flat plate units, dome units, belt units and the like are included. From the covered frequency band, the speaker monomers can be further divided into bass monomers, midrange monomers, treble monomers and full-band monomers. Along with the increase of intelligent products, the application scene of the loudspeaker monomer is also more and more extensive, including but not limited to face-brushing payment equipment, intelligent sound boxes, robots, cameras and the like. Taking a face brushing payment device as an example, the face brushing payment refers to a payment mode which is carried out by a user after face brushing and identity recognition by a camera before the face brushing device, and is safe and convenient. During the payment process, the speaker unit may play a voice prompt, for example, "please aim the front face at the camera". Taking an intelligent sound box as an example, a user calls the sound box to play light music in the morning when getting up, or plays dynamic music in the home exercise, and the like.

Before a product is brought on line, a series of tests are usually required in order to enable the design results to meet the requirements of customers and companies for quality. For convenience of description, referring to fig. 1, fig. 1 is a schematic flow chart of an implementation of a product lifecycle management system according to an embodiment of the present application, and as shown in the drawing, product lifecycle management (Product Lifecycle Management, PLM) is to assist a product in smoothly completing a related engineering implementation operation after new product development (New Product Introduction, NPI) and post-dose, and may be divided into five stages, which are as follows:

A first stage, a product Planning (Planning) stage;

first, propose new product idea or suggestion about product improvement, then screen product idea, finally form product idea.

Stage two, engineering verification and test (Engineering Verification Test, EVT) stage:

the focus at this stage is on correcting design issues that may occur with the product. For example, it is tested whether the functions of the speaker unit, the camera, the wireless module, the display screen, etc. of the face-brushing payment device are available. The test may be performed multiple times.

Stage three, a design verification and test (Design Verification Test, DVT) stage;

the focus at this stage is to find out design and manufacturing issues to ensure that all designs meet specifications and can be produced, at which point the product is substantially shaped.

Stage four, a production verification and test (Production Verification Test, PVT) stage;

the purpose of this stage trial production is to make pre-production manufacturing flow tests, which determine that the factory can make the originally designed product according to standard work flow.

Stage five, mass Production (MP) stage;

after each testing stage, the factory can mass-produce the design, and theoretically, the factory enters the mass-production stage, and all designs and production questions should have no omission or errors, so that the factory can become a formally marketed product.

Based on this, the method for detecting the horn monomer provided by the application can be applied to a DVT stage, namely, a bad horn monomer can be found in advance, so that the million failure rate (Defects Per Million Opportunity, DPMO) of the horn monomer is reduced. The loudspeaker monomer detection method provided by the application can also be applied to the MP stage, namely, poor loudspeaker monomers can be diagnosed and analyzed afterwards. In order to reduce the detection cost and the detection difficulty and improve the detection efficiency in the above-mentioned stage, the present application provides a method for detecting a speaker unit, which is applied to the speaker unit detection system shown in fig. 2, please refer to fig. 2, fig. 2 is a schematic diagram of an architecture of the speaker unit detection system in the embodiment of the present application, and as shown in the drawing, the speaker unit detection system may include a device to be detected, a radio device and a test device. The test device may specifically be a terminal device, and the client is deployed on the terminal device, where the client may run on the terminal device in a browser manner, may also run on the terminal device in an Application (APP) manner, etc., and the specific presentation form of the client is not limited herein. The tested equipment is provided with a loudspeaker monomer to be tested, wherein the tested equipment comprises but is not limited to face-brushing payment equipment, an intelligent sound box, a robot and the like.

Illustratively, it is assumed that the test device has a trained audio detection model stored therein. The tested device plays a section of audio through the loudspeaker monomer to be tested, and collects audio data to be tested through the radio receiving device (such as a microphone), and the audio data to be tested are stored in the testing device in the form of audio files. Therefore, the test equipment can directly call a local audio detection model to analyze the audio data to be tested, and further a detection result of the loudspeaker monomer to be tested is obtained.

Illustratively, the test device establishes a communication connection with the server assuming that a trained audio detection model is stored in the server. The tested device plays a section of audio through the loudspeaker monomer to be tested, and collects audio data to be tested through the radio receiving device (such as a microphone), and the audio data to be tested are stored in the testing device in the form of audio files. Therefore, the test equipment can call the audio detection model stored in the server to analyze the audio data to be tested, and further a detection result of the loudspeaker monomer to be tested is obtained.

The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligence platforms. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. The number of servers and terminal devices is not limited either. The scheme provided by the application can be independently completed by the terminal equipment, can be independently completed by the server, and can be completed by the cooperation of the terminal equipment and the server, so that the application is not particularly limited.

The audio detection model may be trained based on Machine Learning (ML) techniques, where ML is a technology in the field of artificial intelligence (Artificial Intelligence, AI). AI utilizes digital computers or digital computer controlled machines to simulate, extend and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain the best results theory, method, technique, and application system. In other words, AI is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. AI is the design principle and the realization method of researching various intelligent machines, and the machines have the functions of perception, reasoning and decision. AI technology is a comprehensive discipline, and relates to a wide range of technologies, both hardware and software. AI-based technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The AI software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Whereas ML is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. ML is the core of AI, which is the fundamental way for computers to have intelligence, which applies throughout the various fields of AI. ML and deep learning typically includes techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

As AI technology research and advances, AI technology expands research and applications in a variety of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, internet of vehicles, autopilot, smart transportation, etc., and it is believed that as technology progresses, AI technology will find application in more fields and will develop increasingly important value.

With reference to the foregoing description, the solution provided in the embodiments of the present application relates to techniques such as machine learning of artificial intelligence, and a method for detecting a speaker monomer in the present application will be described below, with reference to fig. 3, and one embodiment of the method for detecting a speaker monomer in the embodiments of the present application includes:

110. Acquiring audio data to be tested corresponding to the loudspeaker monomer to be tested, wherein the audio data to be tested is audio data played by the loudspeaker monomer to be tested;

in one or more embodiments, the speaker unit detection device obtains audio data to be detected, where the audio data to be detected is audio data played by the speaker unit to be detected. Specifically, the loudspeaker monomer to be tested plays a section of audio, the audio is recorded by the radio equipment and is stored in a file form, and therefore the audio data to be tested are obtained.

The horn single body detection device may be disposed in a server, a terminal device, or a system composed of the server and the terminal device, and is not limited herein.

120. Acquiring N frequency spectrum images according to audio data to be detected, wherein N is an integer greater than or equal to 1;

in one or more embodiments, the speaker unit detection apparatus converts audio data to be detected into an audio spectrogram (spectrum), the audio spectrogram can be understood as a frequency distribution map, an abscissa of the audio spectrogram is time, an ordinate is frequency, and a coordinate point value represents audio data energy. Since three-dimensional information is expressed by using a two-dimensional plane, the magnitude of the energy value is represented by a color, i.e., the darker the color, the stronger the audio energy of the point.

Specifically, if the audio data to be measured is dual-channel audio data, the audio data to be measured may be converted into a dual-channel audio spectrogram first, and then converted into a mono-channel audio spectrogram. Based on this, the audio spectrogram may be divided into at least one spectral image.

130. Extracting features of each of the N frequency spectrum images to obtain target audio features corresponding to each frequency spectrum image, wherein the target audio features comprise at least one type of audio features;

in one or more embodiments, the speaker unit detection device may perform feature extraction processing on each spectrum image, to obtain a target audio feature corresponding to each spectrum image, that is, obtain N groups of target audio features. Wherein each set of target audio features comprises at least one type of audio feature, e.g., each set of target audio features comprises 5 types of audio features.

Specifically, for convenience of explanation, referring to fig. 4, fig. 4 is a schematic flow chart of audio feature processing in the embodiment of the present application, and as shown in the drawing, in step S1, two-channel audio data to be detected is converted into single-channel audio data to be detected. In step S2, the audio data to be detected of the mono channel is subjected to block and frame processing, where the audio data to be detected may be divided into N blocks of audio data, each block of audio data is represented as one spectral image, and further, each spectral image may be subjected to frame processing, so as to obtain a spectral sub-image.

In one case, step S3 is executed, and feature extraction is performed on the segmented spectrum image, that is, features of the entire spectrum image are extracted, so as to obtain N groups of target audio features. Thus, the target audio feature is input to the audio detection model, and the prediction result of each spectral image is output by the audio detection model.

In another case, step S4 is executed, further framing processing is performed on the segmented spectrum images, and then feature extraction is performed on each frame of spectrum sub-image in each spectrum image, so as to obtain N groups of target audio features. Thus, the target audio features are input to the audio detection model, and the prediction result of each spectrum sub-image is output by the audio detection model.

140. Acquiring a target class label corresponding to each spectrum image through an audio detection model based on a target audio feature corresponding to each spectrum image;

in one or more embodiments, the speaker unit detection device uses the target audio feature corresponding to each spectrum image as an input of an audio detection model, and the audio detection model outputs a prediction result for each spectrum image. If the prediction result is probability distribution, determining a corresponding target class label according to the probability distribution, wherein the probability distribution can be probability distribution obtained for one spectrum image or probability distribution obtained for each spectrum sub-image in the same spectrum image.

In the actual prediction, the target audio features may be directly input to the audio detection model, or may be sequentially input to the audio detection model according to the unit audio features obtained after framing, which is not limited herein.

150. And determining the detection result of the loudspeaker monomer to be detected according to the target class label corresponding to each frequency spectrum image.

In one or more embodiments, after the target class labels corresponding to each spectrum image are obtained, the speaker monomer detection device may determine the detection result of the speaker monomer to be detected according to the N target class labels. If the detection result of the to-be-detected loudspeaker monomer is an abnormal result, further processing can be adopted, for example, prompt related prompts are directly initiated.

In this embodiment of the application, provide a single detection method of loudspeaker, introduce audio frequency detection model through above-mentioned mode and detect single loudspeaker, not only can be adapted to different detection environment, can detect single loudspeaker in batches moreover to reduce detection cost, promote detection efficiency, be favorable to reducing single detection degree of difficulty of loudspeaker.

Optionally, based on the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided in the present application, acquiring N spectral images according to audio data to be measured may specifically include:

Converting the audio data to be tested into an audio spectrogram;

feature extraction is performed on each spectrum image in the N spectrum images to obtain a target audio feature corresponding to each spectrum image, which may specifically include:

dividing each spectrum image in the N spectrum images according to a second preset duration to obtain M spectrum sub-images of each spectrum image, wherein the second preset duration is smaller than the first preset duration, and M is an integer greater than 1;

In one or more embodiments, a manner of chunking and framing audio data to be measured in a time dimension is presented. As can be seen from the foregoing embodiments, the audio data to be detected can be converted into a dual-channel audio spectrogram, and then converted into a mono-channel audio spectrogram.

Specifically, for an audio spectrogram, the audio spectrogram is divided according to a first preset duration, so that N frequency spectrum images are obtained. Assuming that the first preset time period is 30 seconds, if the audio spectrogram is 5 minutes, 10 spectral images can be divided. If the audio spectrogram is 5 minutes and 20 seconds, 11 spectral images can be divided. It can be seen that for an audio spectrogram, if there is a remainder after the first preset time period is divided, one spectral image is added. And dividing the frequency spectrum images according to a second preset time length aiming at each frequency spectrum image, thereby obtaining M frequency spectrum sub-images of each frequency spectrum image. Assuming that the second preset time period is 1 second, if each spectrum image is 30 seconds (at this time, M is equal to 30), each spectrum image may be divided into 30 spectrum sub-images. It is understood that the second preset time period is less than the first preset time period.

For each spectral sub-image of each spectral image, a corresponding unit-to-be-processed audio feature needs to be extracted, wherein the unit-to-be-processed audio feature comprises at least one type of audio feature to be processed. Considering that in the machine learning field, different evaluation indexes often have different dimensions and dimension units, such a situation affects the result of data analysis. Therefore, data normalization processing is needed, so that all indexes are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. Based on the above, the normalization processing can be performed on the audio features of the unit to be processed of each spectrum sub-image in each spectrum image, and finally, the unit audio features of the spectrum sub-images are obtained.

It should be noted that one spectral image has a set of target audio features, and one spectral image includes M spectral sub-images, each having a set of unit audio features. It can be seen that one spectral image has M sets of unit audio features, i.e. the unit audio features are included in the target audio features.

Secondly, in the embodiment of the application, a manner of blocking and framing the audio data to be detected according to the time dimension is provided, through the manner, the audio data to be detected is divided into finer granularity, the audio features with finer granularity are extracted to serve as the input of the model, the features with finer granularity can be learned in the model training stage, and meanwhile, the accuracy and reliability of prediction can be improved in the model reasoning stage.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided in the present application, the obtaining the audio feature of the unit to be processed of the spectrum sub-image may specifically include:

extracting to-be-processed Mel spectrum features of a spectrum sub-image by adopting a first fast Fourier transform size;

In one or more embodiments, a way of extracting audio features of a unit to be processed is presented. As can be seen from the foregoing embodiments, each spectral sub-image can extract corresponding audio features of the unit to be processed, where the audio features of the unit to be processed include, but are not limited to, energy features, time domain features, frequency domain features, music theory features, and perception features. The manner in which the audio features of the unit to be processed are extracted will be described below.

1. The audio features of the unit to be processed comprise Mel frequency spectrum features to be processed;

for example, a first fast fourier transform (fast Fourier transform, FFT) size may be employed to extract the to-be-processed mel spectral features of the spectral sub-images. Wherein the first FFT size may be set to 4096 and the mel spectral feature to be processed may be a mel128 feature, the mel128 feature belonging to the frequency domain feature.

It should be noted that the first FFT size may be set to other values, which are not limited herein.

2. The audio features of the unit to be processed comprise Mel-cepstrum coefficient (Mel-frequency cepstral coefficients, MFCC) features to be processed;

for example, the second FFT size may be employed to extract the MFCC features of the spectral sub-image to be processed. Wherein the second FFT size may be set to 2048. Extraction of MFCC features to be processed is typically performed by pre-emphasis, framing, windowing, FFT, mel-filter bank, discrete cosine transform (Discrete Cosine Transform, DCT), where FFT and mel-filter bank are mainly performed with dimension reduction. The MFCC features to be processed belong to frequency domain features, which are also features commonly used in speech recognition, and the parameters consider the degree of perception of different frequencies by the human ear, so that the MFCC features can also be classified as perception features.

It should be noted that the second FFT size may be set to other values, which are not limited herein.

3. The unit-to-be-processed audio features include a zero-crossing rate-to-be-processed (Zero Crossing Rate) feature;

for example, a third FFT size may be employed to extract the zero crossing rate characteristic to be processed of the spectral sub-image. Wherein the third FFT size may be set to 1024. The zero crossing rate feature to be processed belongs to a time domain feature, and the zero crossing rate feature to be processed represents the rate of change of a signal sign, namely the number of times the signal passes through 0 point (from positive to negative or from negative to positive) in unit time. In general, the larger the zero crossing rate, the higher the frequency approximation.

The third FFT size may be set to other values, which are not limited herein.

4. The audio features of the unit to be processed comprise a spectral flatness (flat) feature to be processed;

illustratively, a fourth FFT size may be employed to extract the spectral flatness features of the spectral sub-image to be processed. Wherein the fourth FFT size may be set to 1024. The feature of the spectral flatness to be processed belongs to the frequency domain feature, the spectral flatness to be processed represents the similarity between the quantized signal and the noise, and the larger the value of the feature is, the more likely the signal is the noise.

The fourth FFT size may be set to other values, which are not limited herein.

5. The unit-to-be-processed audio features include a spectrum centroid (Spectral Centroid) feature to be processed;

for example, a fifth FFT size may be employed to extract the spectral centroid feature of the spectral sub-image to be processed. Wherein the fifth FFT size may be set to 1024. The feature of the centroid of the spectrum to be processed belongs to the feature of the frequency domain, and the centroid of the spectrum to be processed is one of important physical parameters describing the attribute of tone and is used for representing the energy concentration point in the signal spectrum.

The fifth FFT size may be set to other values, which are not limited herein.

In the embodiment of the present application, a method for extracting the audio features of the unit to be processed is provided, by which the audio features of the unit to be processed are extracted from the dimensions of the time domain, the frequency domain, and the like, respectively, different spectrum sub-images can be distinguished based on the audio features of the unit to be processed, and the spectrum sub-images are classified into predefined types, for example, a second normal label and a second abnormal label, by means of an audio detection model for classification. It can be seen that the type of spectral sub-image can be well identified using the valid unit audio features to be processed.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, another optional embodiment provided in the embodiment of the present application may further include:

obtaining format data corresponding to audio data to be detected, wherein the format data comprises bit depth and sampling rate;

encoding the format data to obtain bit depth characteristics and sampling rate characteristics;

based on the target audio feature corresponding to each spectrum image, obtaining the target category label corresponding to each spectrum image through the audio detection model specifically may include:

and acquiring a target class label corresponding to each spectrum image through an audio detection model based on the target audio characteristic, the bit depth characteristic and the sampling rate characteristic corresponding to each spectrum image.

In one or more embodiments, a manner of incorporating a data format predictive target class label is presented. As can be seen from the foregoing embodiments, the audio data to be detected can also be obtained in a corresponding data format, where the data format includes, but is not limited to, bit depth, sampling rate, and bit rate.

1. Bit depth (bit depth);

the bit depth is also referred to as the sampling precision, and is in bits (bits). Common bits are 16 bits and 24 bits. The bit depth influences the signal-to-noise ratio and the dynamic range of the signal and also determines the size of the file, theoretically, the higher the bit depth is, the better the quality is, and the larger the file generated by the bit depth is.

2. Sampling rate (sampling rate);

the total number of samples sampled per second is referred to as the sampling rate in hertz (Hz). The higher the sampling rate, the higher the frequency of the sound wave can be described, and the more true and natural the degree of restoration of the sound wave. Typical sample rates are 8KHz, 16KHz, 32KHz, 48KHz, 44.1KHz, 96KHz, and 192KHz.

3. Bit rate (bit rate);

how many bits are processed per second is referred to as the bit rate in bits per second (bits/s). The higher the bit rate, the better its sound quality. Typical bit rates are 32kbit/s, 96kbit/s, 128kbit/s, 192kbit/s and 256kbit/s.

Specifically, for ease of understanding, referring to fig. 5, fig. 5 is a schematic diagram of an audio spectrum diagram in the embodiment of the present application, as shown in fig. 5 (a), which shows a dual-channel audio spectrum diagram with a bit depth of 16 bits and a sampling rate of 48KHz. As shown in fig. 5 (B), a two-channel audio spectrum diagram is shown, which corresponds to a bit depth of 16 bits and a sampling rate of 32KHz. As shown in fig. 5 (C), a two-channel audio spectrum diagram is shown, which corresponds to a bit depth of 16 bits and a sampling rate of 16KHz.

Assume that the bit depth includes two types, respectively "16bit" and "24bit", and the sampling rate includes three types, respectively "16KHz", "32KHz" and "48KHz". Illustratively, the bit depth and sampling rate may be encoded in a one-hot (one-hot) encoding manner, for example, the bit depth of the audio data to be measured is "16bit", the sampling rate is "48KHz", and thus the bit depth is characterized as (1, 0), and the sampling rate is characterized as (0, 1).

Based on this, the target audio feature, bit depth feature, and sampling rate feature corresponding to each spectral image are input to the audio detection model. The prediction result of each spectrum image can be obtained through the audio detection model, so that the target class label corresponding to the spectrum image can be further determined.

Secondly, in the embodiment of the application, a mode of adding the data format prediction target class label is provided, by using the bit depth feature and the sampling rate feature as the basis of the prediction target class label, the feature richness can be increased, and the accuracy and the reliability of prediction can be improved.

Optionally, in another optional embodiment provided in the embodiment of the present application on the basis of the respective embodiment corresponding to fig. 3, each spectral image includes M spectral sub-images, the target audio feature includes M unit audio features, and the unit audio features have a correspondence with the spectral sub-images, where M is an integer greater than 1;

acquiring a first feature map of each spectrum sub-image through a convolution network included in an audio detection model based on unit audio features of each spectrum sub-image for each spectrum image;

In one or more embodiments, a manner of predicting a target probability value based on an audio detection model is presented. As can be seen from the foregoing embodiments, the audio data to be detected is processed to obtain N spectral images, where each spectral image includes M spectral sub-images, and each spectral sub-image extracts a corresponding unit audio feature.

Illustratively, the audio detection model may include two networks, a convolutional network and a time-sequential network, respectively. For ease of illustration, referring to fig. 6, fig. 6 is a schematic diagram of a convolutional network and a timing network according to an embodiment of the present application, where the convolutional network illustrated in fig. 6 (a) may be a convolutional neural network (Convolutional Neural Network, CNN), and the convolutional network includes an input layer, a convolutional layer, and a pooling layer. The purpose of convolution operation is to extract different features of input, the first layer convolution layer may only extract some low-level features such as edges, lines, angles and other levels, and more convolution layers can iteratively extract more complex features from the low-level features. The pooling layer typically obtains very large-dimension features after the convolution layer, cuts the features into several regions, and takes the maximum or average value to obtain new and smaller-dimension features.

The timing network illustrated in fig. 6 (B) may be a gating loop unit (gated recurrent unit, GRU), where (1-z) may be considered "forget gate" and (z) may be considered "updates" and (z) is associated with the selected "(1-z)", i.e., selective forgetting of the transferred information may be performed. "r" may be considered a "reset gate" for determining how to combine new input information with memory.

Specifically, for ease of understanding, referring to fig. 7, fig. 7 is a schematic structural diagram of an audio detection model according to an embodiment of the present application, where an audio spectrogram is divided into N spectral images, and each spectral image is further divided into M spectral sub-images as shown in the drawing. Based on the above, the unit audio features corresponding to the M spectral sub-images belonging to the same spectral image are respectively input to the convolution network in the audio detection model, and the first feature map of each spectral sub-image is output through the convolution network. It will be appreciated that a convolutional network may include 6 network layers, for example, a first layer network in the convolutional network employs 64 convolutional layers, a convolutional kernel size of 3 x 3, a stride (stride) of 1, and a padding (padding) of 1. The second layer of the convolutional network uses a max pooling layer, with windows of 2 x 2 and stride of 2. The third layer of the convolutional network uses 128 convolutional kernels, the size of which is 3×3, stride is 1, and padding is 1. The fourth layer network in the convolution network adopts a maximum pooling layer, windows is 2×2, and stride is 2. The fifth layer network in the convolution network adopts 256 convolution kernels, the convolution kernel size is 3×3, stride is 1, and padding is 1. The sixth layer network in the convolution network adopts a maximum pooling layer, windows is 2×2, and stride is 2.

The first characteristic diagram of each spectrum sub-image is taken as an input of an activation network, and the second characteristic diagram of each spectrum sub-image is output through the activation network. It is to be appreciated that the active network can include 2 network layers, e.g., a first network in the active network employs a batch standardization (Batch Normalizaiton) layer. The second network of the active networks employs a max pooling layer.

And taking the second characteristic diagram of each spectrum sub-image as the input of a time sequence network, and outputting the characteristic vector of each spectrum sub-image through the time sequence network. It is understood that the timing network may be a GRU, a bidirectional GRU (BiGRU), a Long Short-Term Memory (LSTM), or a bidirectional LSTM (BiLSTM), etc. Taking a GRU as an example, please refer to fig. 8, fig. 8 is a schematic structural diagram of a gating cycle unit according to an embodiment of the present application, and the BiGRU is a network composed of two GRUs that are unidirectional and opposite in direction. At each instant, the input provides two GRUs in opposite directions simultaneously, while the output is determined by both unidirectional GRUs. Wherein the hidden unit is 256 dimensions. And x t represents the input of the t frame, and the current hidden state of the BiGRU is determined by three parts of x t, the forward hidden state of the (t-1) frame and the reverse hidden state of the (t-1) frame.

And taking the characteristic vector of each spectrum sub-image as the input of the full connection layer, and outputting the category probability distribution of each spectrum sub-image through the full connection layer. Further, a target class label of the spectral image is determined from the class label of the spectral sub-image. The fully connected layer may employ a sigmoid function, e.g., a category probability distribution of "1" for a category label of a first normal label and a category probability distribution of "0" for a category label of a first abnormal label. For another example, a category probability distribution of "0.8" indicates that the probability of classifying a tag as the first normal tag is 80%. It can be appreciated that the audio data to be measured may obtain m×n class probability distributions.

Secondly, in the embodiment of the application, a mode of predicting a target probability value based on an audio detection model is provided, by using unit audio features of each frame of spectrum sub-image in the spectrum image as an audio detection model input, and thus, the class probability distribution of each frame of spectrum sub-image is output. Because the audio detection model comprises convolution and recursive calculation and is provided with the feedforward neural network with the depth structure, on one hand, the characteristics of the frequency spectrum sub-images can be learned, and on the other hand, the characteristics among the frequency spectrum sub-images can be learned, so that the prediction accuracy is improved.

Illustratively, the audio detection model may include a convolutional network, and it is understood that the structure of the convolutional network is shown in fig. 6, which is not described herein.

Specifically, for ease of understanding, referring to fig. 9, fig. 9 is another schematic structural diagram of an audio detection model according to an embodiment of the present application, where an audio spectrogram is divided into N spectral images, and each spectral image is further divided into M spectral sub-images as shown in the drawing. Based on the above, the unit audio features corresponding to the M spectral sub-images belonging to the same spectral image are respectively input to the convolution network in the audio detection model, and the first feature map of each spectral sub-image is output through the convolution network. It is understood that the convolutional network may include 6 network layers as described in the foregoing embodiments, and will not be described herein. The first characteristic diagram of each spectrum sub-image is taken as an input of an activation network, and the second characteristic diagram of each spectrum sub-image is output through the activation network. It is understood that the active network may include 2 network layers as described in the foregoing embodiments, and will not be described herein. And taking the second feature map of each spectrum sub-image as the input of the full-connection layer, and outputting the category probability distribution of each spectrum sub-image through the full-connection layer. The classification labels of the spectral sub-images may be determined based on the class probability distribution, and further, the target class labels of the spectral images may be determined from the classification labels of the spectral sub-images. It can be appreciated that the audio data to be measured may obtain m×n class probability distributions.

Secondly, in the embodiment of the application, a mode of predicting a target probability value based on an audio detection model is provided, by using unit audio features of each frame of spectrum sub-image in the spectrum image as an audio detection model input, and thus, the class probability distribution of each frame of spectrum sub-image is output. Because the audio detection model comprises convolution calculation and has a feedforward neural network with a depth structure, the characteristics of the spectrum sub-image can be learned, and the prediction accuracy is improved.

acquiring an audio data sample, wherein the audio data sample corresponds to a labeling category label;

obtaining P frequency spectrum image samples according to the audio data samples, wherein P is an integer greater than or equal to 1;

dividing each spectrum image sample in the P spectrum image samples according to a second preset duration to obtain M spectrum sub-image samples corresponding to each spectrum image sample;

aiming at each spectrum sub-image sample of each spectrum image sample, acquiring the audio characteristics of a unit to be processed of the spectrum sub-image sample;

For each spectrum sub-image sample of each spectrum image sample, carrying out normalization processing on the audio features of the unit to be processed of the spectrum sub-image sample to obtain the unit audio features of the spectrum sub-image sample;

aiming at each spectrum image sample, acquiring class probability distribution of each spectrum sub-image sample through an audio detection model based on unit audio features of M spectrum sub-image samples;

and updating model parameters of the audio detection model according to the labeling class labels and class probability distribution of each spectrum sub-image sample aiming at each spectrum image sample.

In one or more embodiments, a manner of training an audio detection model is presented. As can be seen from the foregoing embodiments, the audio detection model needs to be trained based on a large number of audio data samples, for convenience of understanding, fig. 10 is a schematic flow chart of training the audio detection model in the embodiment of the application, as shown in the fig. 10, first the audio data samples need to be collected, then the audio data samples need to be extracted by adopting audio feature engineering, and in addition, the model structure needs to be related. Next, model training is started, and finally, the trained model is evaluated.

Specifically, for convenience of explanation, an audio data sample will be exemplified below. The audio data sample needs to be marked with a labeling type label, for example, the labeling type label is "1" to indicate that the speaker unit is normal, and the labeling type label is "0" to indicate that the speaker unit is abnormal, wherein the abnormal situation of the speaker unit includes but is not limited to sound breaking, silence and noise. Wherein, the normal horn monomer is represented by a normal horn audio waveform, and the abnormal horn monomer is represented by an abnormal horn audio waveform. Referring to fig. 11, fig. 11 is a schematic diagram of implementing speaker monomer detection based on an audio detection model in the embodiment of the present application, where as shown in the drawing, after an audio data sample is converted, P spectrum image samples are divided, then each spectrum image sample is divided according to a second preset duration, so as to obtain M spectrum sub-image samples corresponding to each spectrum image sample, and features of each spectrum sub-image sample are extracted according to an audio rule, so as to obtain audio features of a unit to be processed of the spectrum sub-image sample, and then normalization processing is performed on the audio features of the unit to be processed, so as to obtain unit audio features of the spectrum sub-image sample. And sequentially inputting the corresponding M unit audio features in each spectrum image sample into an audio detection model, thereby obtaining the class probability distribution of each spectrum sub-image sample.

During training, a loss function (e.g., a cross entropy loss function) is used to calculate a loss value based on the true value (i.e., labeling the class labels) and the predicted value (i.e., class probability distribution), and model parameters of the audio detection model are updated with the loss value. In practical training, the number of normal audio data samples and abnormal audio data samples is similar, and in general, a large number of audio data samples need to be subjected to data cleaning to remove invalid samples.

In the embodiment of the application, a mode for training an audio detection model is provided, through the mode, an audio data sample is added for training the audio detection model, so that model parameters of the audio detection model can be continuously optimized, a better fitting effect is achieved, and the detection accuracy is improved.

Optionally, in another optional embodiment provided in the embodiment of the present application, for each spectrum image, the determining, according to the class probability distribution of each spectrum sub-image, the target class label corresponding to the spectrum image may specifically include:

for each spectrum image, determining a corresponding classification label of each spectrum sub-image according to the classification probability distribution of each spectrum sub-image;

In one or more embodiments, a method for determining a target category label based on label duty cycle is presented. As can be seen from the foregoing embodiments, the corresponding classification label can be determined according to the classification probability distribution of the spectrum sub-image, for example, the classification probability distribution is "1", which indicates that the classification label is the first normal label, and the classification probability distribution is "0", which indicates that the classification label is the first abnormal label. Based on this, for one spectrum image, the total number of the first normal tags in the M classification tags is counted.

Specifically, taking a spectrum image as an example, the first normal label duty ratio may be calculated according to the total number of the first normal labels in the M classified labels. Assuming that M is 10 and the total number of the first normal tags is 8, the first normal tag accounts for 80%. Assuming that the first duty ratio threshold is 60%, if the first normal label duty ratio is greater than or equal to the first duty ratio threshold, the target class label representing the spectrum image is the second normal label, otherwise, if the first normal label duty ratio is less than the first duty ratio threshold, the target class label representing the spectrum image is the second abnormal label.

In the embodiment of the present application, the determining the target category label based on the label duty ratio is provided, and by adopting the above manner, whether the target category label belongs to the second normal label is determined from the duty ratio dimension, so that feasibility of the scheme is improved.

In one or more embodiments, a method for determining a target category label based on label duty cycle and confidence is presented. From the foregoing embodiments, it can be seen that, according to the class probability distribution of the spectrum sub-image, not only the corresponding class label but also the confidence level belonging to the first normal label can be determined.

Specifically, taking a spectrum image as an example, the first normal label duty ratio may be calculated according to the total number of the first normal labels in the M classified labels. Assuming that M is 10 and the total number of the first normal tags is 8, the first normal tag accounts for 80%. Wherein, it is assumed that the 8 category probability distributions belonging to the first normal label include "1", "0.8", "0.9", "0.8", "0.9", and "0.8", and thus, the average confidence is obtained as "0.8875". Assuming that the first duty ratio threshold is 60% and the confidence coefficient threshold is 0.8, if the first normal label duty ratio is greater than or equal to the first duty ratio threshold and the average confidence coefficient is greater than or equal to the confidence coefficient threshold, the target class label representing the spectrum image is the second normal label. And if the normal label duty ratio is smaller than the first duty ratio threshold value, the target class label representing the frequency spectrum image is a second abnormal label. And if the average confidence coefficient is smaller than the confidence coefficient threshold value, the target class label representing the frequency spectrum image is a second abnormal label.

In the embodiment of the application, the target class label is determined based on the label duty ratio and the confidence, and whether the target class label belongs to the second normal label is judged from different dimensions in the mode, so that the judging diversity and reliability are improved.

Optionally, based on the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided in the present application, acquiring, by an audio detection model, a target class label corresponding to each spectrum image based on a target audio feature corresponding to each spectrum image may specifically include:

acquiring a first feature map of each spectrum image through a convolution network included in an audio detection model based on a target audio feature corresponding to each spectrum image;

In one or more embodiments, a manner of predicting a target probability value based on an audio detection model is presented. As can be seen from the foregoing embodiments, the audio data to be detected is processed to obtain N spectral images, and each spectral image is extracted to a corresponding target audio feature.

Specifically, for ease of understanding, please refer to fig. 12, fig. 12 is another schematic structural diagram of an audio detection model according to an embodiment of the present application, and an audio spectrogram is divided into N spectral images as shown in the drawing. Based on the target audio features corresponding to the N frequency spectrum images are respectively input into a convolution network in the audio detection model, and a first feature map of each frequency spectrum image is output through the convolution network. It is understood that the convolutional network may include 6 network layers as described in the foregoing embodiments, and will not be described herein. The first characteristic diagram of each spectrum image is taken as an input of an activation network, and the second characteristic diagram of each spectrum image is output through the activation network. It is understood that the active network may include 2 network layers as described in the foregoing embodiments, and will not be described herein. And taking the second feature map of each spectrum image as the input of the full connection layer, and outputting the class probability distribution of each spectrum image through the full connection layer. Thus, a target class label for the spectral image may be determined based on the class probability distribution. It will be appreciated that the audio data to be tested may result in N category probability distributions.

Next, in the embodiment of the present application, a manner of predicting a target probability value based on an audio detection model is provided, by which a target audio feature of a spectrum image is input as the audio detection model, so as to output a class probability distribution of each spectrum image. Because the audio detection model comprises convolution calculation and has a feedforward neural network with a depth structure, the characteristics of the spectrum image can be learned, and the prediction accuracy is improved.

Optionally, in another optional embodiment provided in the embodiment of the present application, based on the respective embodiments corresponding to fig. 3, the determining, according to the target class label corresponding to each spectrum image, a detection result of the speaker monomer to be detected may specifically include:

determining the number of second normal labels according to the target class labels corresponding to each spectrum image;

In one or more embodiments, a manner of determining a detection result based on a target class label is presented. As can be seen from the foregoing embodiments, the target class label corresponding to each spectrum image can be determined according to the class label of each spectrum sub-image in each spectrum image. Based on the above, for the speaker monomer to be tested, the total number of the second normal tags in the N target class tags is counted.

Specifically, taking a speaker monomer to be tested as an example for illustration, the second normal label duty ratio can be calculated according to the total number of the second normal labels in the N target class labels. Assuming that N is 100 and the total number of second normal tags is 90, the second normal tag accounts for 90%. And if the second duty ratio threshold is 80%, the second normal label duty ratio is larger than or equal to the second duty ratio threshold, the detection result of the to-be-detected horn monomer is indicated to be a normal result, otherwise, if the second normal label duty ratio is smaller than the second duty ratio threshold, the detection result of the to-be-detected horn monomer is indicated to be an abnormal result.

Secondly, in the embodiment of the present application, a manner of determining a detection result based on a target class label is provided, by which whether the detection result belongs to a normal result is determined from a duty ratio dimension, thereby improving feasibility of a scheme.

With reference to the foregoing description, a method for audio detection in the present application will be described below, referring to fig. 13, an embodiment of the audio detection method in the embodiment of the present application includes:

210. acquiring audio data to be tested;

in one or more embodiments, the audio detection device obtains audio data to be detected, where the audio data to be detected is data stored in a file form.

The audio detection device may be disposed in a server, a terminal device, or a system including the server and the terminal device, and is not limited herein.

220. Acquiring N frequency spectrum images according to audio data to be detected, wherein N is an integer greater than or equal to 1;

in one or more embodiments, the audio detection device converts audio data to be detected into an audio spectrogram, which can be understood as a frequency distribution graph, the abscissa of the audio spectrogram is time, the ordinate is frequency, and the coordinate point value represents audio data energy. Since three-dimensional information is expressed by using a two-dimensional plane, the magnitude of the energy value is represented by a color, i.e., the darker the color, the stronger the audio energy of the point.

230. Extracting features of each of the N frequency spectrum images to obtain target audio features corresponding to each frequency spectrum image, wherein the target audio features comprise at least one type of audio features;

in one or more embodiments, the audio detection device may perform feature extraction processing on each spectrum image, to obtain a target audio feature corresponding to each spectrum image, that is, obtain N groups of target audio features. Wherein each set of target audio features includes at least one type of audio feature.

It should be noted that, the process of the audio feature processing may refer to the corresponding description of fig. 4, which is not repeated here.

240. Acquiring a target class label corresponding to each spectrum image through an audio detection model based on a target audio feature corresponding to each spectrum image;

in one or more embodiments, the audio detection apparatus uses the target audio feature corresponding to each spectrum image as an input of an audio detection model, and the audio detection model outputs a prediction result for each spectrum image. If the prediction result is probability distribution, determining a corresponding target class label according to the probability distribution, wherein the probability distribution can be probability distribution obtained for one spectrum image or probability distribution obtained for each spectrum sub-image in the same spectrum image.

250. And determining a detection result of the audio data to be detected according to the target class label corresponding to each frequency spectrum image.

In one or more embodiments, after the target class labels corresponding to each spectrum image are obtained, the audio detection device may determine the detection result of the audio data to be detected according to the N target class labels. If the detection result of the audio data to be detected is an abnormal result, further processing can be adopted, for example, prompt related prompts are directly initiated.

In this embodiment of the application, an audio detection method is provided, by adopting the above manner, an audio detection model is introduced to detect audio data, so that the method is not only suitable for different detection environments, but also can detect audio data in batches, thereby reducing detection cost, improving detection efficiency and being beneficial to reducing detection difficulty of audio data.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating an embodiment of a speaker unit detection apparatus according to an embodiment of the present application, and the speaker unit detection apparatus 30 includes:

The obtaining module 310 is configured to obtain audio data to be tested corresponding to the speaker monomer to be tested, where the audio data to be tested is audio data played by the speaker monomer to be tested;

the acquiring module 310 is further configured to acquire N spectrum images according to the audio data to be detected, where N is an integer greater than or equal to 1;

the extracting module 320 is configured to perform feature extraction on each of the N spectrum images to obtain a target audio feature corresponding to each spectrum image, where the target audio feature includes at least one type of audio feature;

the obtaining module 310 is further configured to obtain, based on the target audio feature corresponding to each spectrum image, a target class label corresponding to each spectrum image through an audio detection model;

the determining module 330 is configured to determine a detection result of the speaker monomer to be detected according to the target class label corresponding to each spectrum image.

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, the audio detection model is introduced to detect the loudspeaker monomers, so that the device is not only suitable for different detection environments, but also can detect the loudspeaker monomers in batches, thereby reducing the detection cost, improving the detection efficiency and being beneficial to reducing the detection difficulty of the loudspeaker monomers.

Alternatively, in another embodiment of the horn unit detection apparatus 30 provided in the embodiment of the present application based on the embodiment corresponding to fig. 14,

the obtaining module 310 is specifically configured to convert the audio data to be tested into an audio spectrogram;

the extracting module 320 is specifically configured to divide each spectrum image in the N spectrum images according to a second preset duration, so as to obtain M spectrum sub-images of each spectrum image, where the second preset duration is smaller than the first preset duration, and M is an integer greater than 1;

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, the audio data to be detected is divided into finer granularity, the audio features with finer granularity are extracted to serve as the input of the model, the features with finer details can be learned in the model training stage, and meanwhile, the accuracy and reliability of prediction can be improved in the model reasoning stage.

the extracting module 320 is specifically configured to extract a mel spectrum feature to be processed of the spectrum sub-image by adopting a first fft size;

In this application embodiment, a loudspeaker monomer detection device is provided. With the device, the audio features of the unit to be processed are extracted from the dimensions of the time domain and the frequency domain, respectively, different spectrum sub-images can be distinguished based on the audio features of the unit to be processed, and the spectrum sub-images are divided into predefined types, such as a second normal label and a second abnormal label, by means of an audio detection model for classification. It can be seen that the type of spectral sub-image can be well identified using the valid unit audio features to be processed.

Optionally, in another embodiment of the horn unit detection apparatus 30 provided in the embodiment of the present application based on the embodiment corresponding to fig. 14, the horn unit detection apparatus 30 further includes a coding module 340;

the obtaining module 310 is further configured to obtain format data corresponding to the audio data to be tested, where the format data includes a bit depth and a sampling rate;

the encoding module 340 is configured to encode the format data to obtain a bit depth feature and a sampling rate feature;

the obtaining module 310 is specifically configured to obtain, through an audio detection model, a target class label corresponding to each spectrum image based on the target audio feature, the bit depth feature, and the sampling rate feature corresponding to each spectrum image.

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, the bit depth characteristic and the sampling rate characteristic are used as the basis of the prediction target class label, so that the richness of the characteristic can be increased, and the accuracy and the reliability of prediction can be improved.

Optionally, in another embodiment of the horn unit detection apparatus 30 provided in the embodiment of fig. 14, each spectrum image includes M spectrum sub-images, the target audio feature includes M unit audio features, and the unit audio features have a correspondence with the spectrum sub-images, where M is an integer greater than 1;

An obtaining module 310, specifically configured to obtain, for each spectrum image, a first feature map of each spectrum sub-image through a convolution network included in the audio detection model based on the unit audio feature of each spectrum sub-image;

In this application embodiment, a loudspeaker monomer detection device is provided. The device is adopted, the unit audio characteristics of each frame frequency spectrum sub-image in the frequency spectrum image are used as the input of an audio detection model, and the class probability distribution of each frame frequency spectrum sub-image is output. Because the audio detection model comprises convolution and recursive calculation and is provided with the feedforward neural network with the depth structure, on one hand, the characteristics of the frequency spectrum sub-images can be learned, and on the other hand, the characteristics among the frequency spectrum sub-images can be learned, so that the prediction accuracy is improved.

In this application embodiment, a loudspeaker monomer detection device is provided. The device is adopted, the unit audio characteristics of each frame frequency spectrum sub-image in the frequency spectrum image are used as the input of an audio detection model, and the class probability distribution of each frame frequency spectrum sub-image is output. Because the audio detection model comprises convolution calculation and has a feedforward neural network with a depth structure, the characteristics of the spectrum sub-image can be learned, and the prediction accuracy is improved.

Optionally, in another embodiment of the horn unit detection apparatus 30 provided in the embodiment of the present application, based on the embodiment corresponding to fig. 14, the horn unit detection apparatus 30 further includes a processing module 350 and a training module 360;

the obtaining module 310 is further configured to obtain an audio data sample, where the audio data sample corresponds to a label class tag;

the obtaining module 310 is further configured to obtain P spectrum image samples according to the audio data samples, where P is an integer greater than or equal to 1;

the obtaining module 310 is further configured to divide each spectrum image sample in the P spectrum image samples according to a second preset duration, so as to obtain M spectrum sub-image samples corresponding to each spectrum image sample;

the obtaining module 310 is further configured to obtain, for each spectral sub-image sample of each spectral image sample, a unit audio feature to be processed of the spectral sub-image sample;

the processing module 350 is configured to normalize, for each spectral sub-image sample of each spectral image sample, an audio feature of a unit to be processed of the spectral sub-image sample to obtain a unit audio feature of the spectral sub-image sample;

the obtaining module 310 is further configured to obtain, for each spectrum sub-image sample, a class probability distribution of each spectrum sub-image sample through an audio detection model based on unit audio features of the M spectrum sub-image samples;

The training module 360 is configured to update, for each spectrum image sample, model parameters of the audio detection model according to the labeling class label and the class probability distribution of each spectrum sub-image sample.

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, the audio data sample is added for training the audio detection model, so that model parameters of the audio detection model can be continuously optimized, a better fitting effect is achieved, and the detection accuracy is improved.

an acquisition module 310, specifically configured to determine, for each spectrum image, a corresponding classification label of each spectrum sub-image according to a classification probability distribution of each spectrum sub-image;

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, whether the target class label belongs to the second normal label is judged from the duty ratio dimension, so that the feasibility of the scheme is improved.

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, whether the target class label belongs to the second normal label is judged from different dimensions, so that the judging diversity and reliability are improved.

the obtaining module 310 is specifically configured to obtain, based on the target audio feature corresponding to each spectrum image, a first feature map of each spectrum image through a convolution network included in the audio detection model;

In this application embodiment, a loudspeaker monomer detection device is provided. The device is adopted, the target audio characteristics of the spectrum images are used as the input of an audio detection model, and the class probability distribution of each spectrum image is output. Because the audio detection model comprises convolution calculation and has a feedforward neural network with a depth structure, the characteristics of the spectrum image can be learned, and the prediction accuracy is improved.

the determining module 330 is specifically configured to determine, according to the target class label corresponding to each spectrum image, the number of second normal labels;

In this application embodiment, a loudspeaker monomer detection device is provided. By adopting the device, whether the detection result belongs to a normal result is judged from the duty ratio dimension, so that the feasibility of the scheme is improved.

Referring to fig. 15, fig. 15 is a schematic diagram illustrating an embodiment of an audio detection device according to an embodiment of the present application, and the audio detection device 40 includes:

an acquisition module 410, configured to acquire audio data to be detected;

the acquiring module 410 is further configured to acquire N spectral images according to the audio data to be detected, where N is an integer greater than or equal to 1;

the extracting module 420 is configured to perform feature extraction on each of the N spectrum images to obtain a target audio feature corresponding to each spectrum image, where the target audio feature includes at least one type of audio feature;

the obtaining module 410 is further configured to obtain, through an audio detection model, a target class label corresponding to each spectrum image based on the target audio feature corresponding to each spectrum image;

The determining module 430 is configured to determine a detection result of the audio data to be detected according to the target class label corresponding to each spectrum image.

In an embodiment of the present application, an audio detection apparatus is provided. By adopting the device, the audio data is detected by introducing the audio detection model, so that the device is not only suitable for different detection environments, but also can detect the audio data in batches, thereby reducing the detection cost, improving the detection efficiency and being beneficial to reducing the detection difficulty of the audio data.

The embodiment of the application further provides another speaker unit detection device and an audio detection device, which can be deployed in a terminal device, as shown in fig. 16, for convenience of explanation, only the portion related to the embodiment of the application is shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal device as an example of the mobile phone:

fig. 16 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 16, the mobile phone includes: radio Frequency (RF) circuitry 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuitry 560, wireless fidelity (wireless fidelity, wiFi) module 570, processor 580, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 16 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 16:

the RF circuit 510 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the signal is processed by the processor 580; in addition, the data of the design uplink is sent to the base station. Typically, the RF circuitry 510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 510 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System of Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), email, short message service (Short Messaging Service, SMS), and the like.

The memory 520 may be used to store software programs and modules, and the processor 580 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 520. The memory 520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 531 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 580, and can receive commands from the processor 580 and execute them. In addition, the touch panel 531 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 540 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541, and alternatively, the display panel 541 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although in fig. 16, the touch panel 531 and the display panel 541 are two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 550, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 541 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 541 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 560, speakers 561, microphone 562 may provide an audio interface between the user and the handset. The audio circuit 560 may transmit the received electrical signal converted from audio data to the speaker 561, and the electrical signal is converted into a sound signal by the speaker 561 and output; on the other hand, microphone 562 converts the collected sound signals into electrical signals, which are received by audio circuit 560 and converted into audio data, which are processed by audio data output processor 580 for transmission to, for example, another cell phone via RF circuit 510, or for output to memory 520 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 570, so that wireless broadband Internet access is provided for the user. Although fig. 16 shows a WiFi module 570, it is understood that it does not belong to the necessary constitution of the handset, and can be omitted entirely as required within the scope of not changing the essence of the invention.

Processor 580 is the control center of the handset, connects the various parts of the entire handset using various interfaces and lines, and performs various functions and processes of the handset by running or executing software programs and/or modules stored in memory 520, and invoking data stored in memory 520, thereby performing overall monitoring of the handset. Optionally, processor 580 may include one or more processing units; alternatively, processor 580 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 580.

The handset further includes a power supply 590 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 580 via a power management system so as to perform charge, discharge, and power management functions via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

The steps performed by the terminal device in the above-described embodiments may be based on the terminal device structure shown in fig. 16.

Another speaker unit detection apparatus and an audio detection apparatus are provided in the embodiments of the present application, which may be deployed in a server, and fig. 17 is a schematic diagram of a server structure provided in the embodiments of the present application, where the server 600 may generate relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) storing application programs 642 or data 644. Wherein memory 632 and storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 622 may be configured to communicate with a storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.

The server 600 may also include one or more power supplies 626, one or more wires orWireless network interface 650, one or more input/output interfaces 658, and/or one or more operating systems 641, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 17.

Also provided in embodiments of the present application is a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the methods as described in the foregoing embodiments.

Also provided in embodiments of the present application is a computer program product comprising a program which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The method for detecting the loudspeaker monomer is characterized by comprising the following steps of:

obtaining audio data to be tested corresponding to a loudspeaker monomer to be tested, wherein the audio data to be tested is audio data played by the loudspeaker monomer to be tested;

acquiring N frequency spectrum images according to the audio data to be detected, wherein N is an integer greater than or equal to 1;

extracting features of each spectrum image in the N spectrum images to obtain target audio features corresponding to each spectrum image, wherein the target audio features comprise at least one type of audio features;

acquiring a target category label corresponding to each spectrum image through an audio detection model based on the target audio characteristics corresponding to each spectrum image;

2. The method according to claim 1, wherein the acquiring N spectral images from the audio data to be detected includes:

converting the audio data to be tested into an audio spectrogram;

dividing the audio spectrogram according to a first preset duration to obtain the N frequency spectrum images;

the feature extraction of each spectrum image in the N spectrum images to obtain the target audio feature corresponding to each spectrum image includes:

dividing each spectrum image in the N spectrum images according to a second preset time length to obtain M spectrum sub-images of each spectrum image, wherein the second preset time length is smaller than the first preset time length, and M is an integer larger than 1;

and carrying out normalization processing on the unit audio features to be processed of the spectrum sub-images aiming at each spectrum sub-image of each spectrum image to obtain the unit audio features of the spectrum sub-images, wherein the unit audio features are contained in the target audio features.

3. The method according to claim 2, wherein the capturing the audio features of the unit to be processed of the spectral sub-image comprises:

4. The method of detection according to claim 1, wherein the method further comprises:

obtaining format data corresponding to the audio data to be detected, wherein the format data comprises bit depth and sampling rate;

the obtaining, by an audio detection model, a target class label corresponding to each spectrum image based on the target audio feature corresponding to each spectrum image includes:

And acquiring a target class label corresponding to each spectrum image through the audio detection model based on the target audio feature, the bit depth feature and the sampling rate feature corresponding to each spectrum image.

5. The method according to claim 1, wherein each spectral image includes M spectral sub-images, the target audio feature includes M unit audio features, the unit audio features have a correspondence with the spectral sub-images, and M is an integer greater than 1;

for each spectrum image, acquiring a first feature map of each spectrum sub-image through a convolution network included in the audio detection model based on unit audio features of each spectrum sub-image;

for each spectrum image, acquiring a second feature map of each spectrum sub-image through an activation network included in the audio detection model based on the first feature map of each spectrum sub-image;

aiming at each frequency spectrum image, acquiring category probability distribution of each frequency spectrum sub-image through a full connection layer included in the audio detection model based on the feature vector of each frequency spectrum sub-image;

and aiming at each spectrum image, determining a target category label corresponding to the spectrum image according to the category probability distribution of each spectrum sub-image.

6. The method according to claim 1, wherein each spectral image includes M spectral sub-images, the target audio feature includes M unit audio features, the unit audio features have a correspondence with the spectral sub-images, and M is an integer greater than 1;

for each spectrum image, acquiring a category probability distribution of each spectrum sub-image through a full-connection layer included in the audio detection model based on a second feature map of each spectrum sub-image;

7. The method of detection according to claim 5 or 6, characterized in that the method further comprises:

Normalizing the audio features of the unit to be processed of the spectrum sub-image samples according to each spectrum sub-image sample of each spectrum sub-image sample to obtain the unit audio features of the spectrum sub-image samples;

aiming at each spectrum image sample, acquiring category probability distribution of each spectrum sub-image sample through the audio detection model based on unit audio characteristics of M spectrum sub-image samples;

and updating model parameters of the audio detection model according to the labeling class labels and the class probability distribution of each spectrum sub-image sample for each spectrum image sample.

8. The detection method according to claim 5 or 6, wherein the determining, for each spectrum image, the target class label corresponding to the spectrum image according to the class probability distribution of each spectrum sub-image includes:

for each spectrum image, determining the number of first normal labels according to the corresponding classification labels of each spectrum sub-image;

For each spectrum image, determining a first normal label duty ratio according to the number of the first normal labels;

for each spectrum image, if the first normal label duty ratio is greater than or equal to a first duty ratio threshold, determining that a target class label corresponding to the spectrum image is a second normal label;

for each spectrum image, if the first normal label duty ratio is smaller than the first duty ratio threshold, determining that the target class label corresponding to the spectrum image is a second abnormal label.

9. The detection method according to claim 5 or 6, wherein the determining, for each spectrum image, the target class label corresponding to the spectrum image according to the class probability distribution of each spectrum sub-image includes:

For each spectrum image, determining the average confidence belonging to the first normal label according to the class probability distribution of each spectrum sub-image;

if the normal label duty ratio is greater than or equal to a first duty ratio threshold value and the average confidence coefficient is greater than or equal to a confidence coefficient threshold value, determining that the target class label corresponding to the frequency spectrum image is a second normal label;

if the normal label duty ratio is greater than or equal to the first duty ratio threshold and the average confidence coefficient is smaller than the confidence coefficient threshold, determining that the target class label corresponding to the frequency spectrum image is a second abnormal label;

and if the normal label duty ratio is smaller than the first duty ratio threshold and the average confidence coefficient is smaller than the confidence coefficient threshold, determining that the target class label corresponding to the frequency spectrum image is a second abnormal label.

10. The detection method according to claim 1, wherein the obtaining, by an audio detection model, the target class label corresponding to each spectrum image based on the target audio feature corresponding to each spectrum image includes:

Acquiring a first feature map of each spectrum image through a convolution network included in the audio detection model based on the target audio feature corresponding to each spectrum image;

and acquiring a target category label corresponding to each spectrum image through a full-connection layer included in the audio detection model based on the second feature map of each spectrum image.

11. The method of detecting according to claim 1, wherein the determining the detection result of the speaker monomer to be detected according to the target class label corresponding to each spectrum image includes:

if the second normal tag duty ratio is greater than or equal to a second duty ratio threshold value, determining that the detection result of the to-be-detected loudspeaker monomer is a normal result;

if the second normal tag duty ratio is smaller than the second duty ratio threshold, determining that the detection result of the to-be-detected horn monomer is an abnormal result.

12. A method of audio detection, comprising:

acquiring audio data to be tested;

13. A horn monomer detection device, comprising:

the device comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring audio data to be detected corresponding to a loudspeaker monomer to be detected, wherein the audio data to be detected is audio data played by the loudspeaker monomer to be detected;

the acquisition module is further configured to acquire N spectrum images according to the audio data to be detected, where N is an integer greater than or equal to 1;

The acquisition module is further used for acquiring a target category label corresponding to each spectrum image through an audio detection model based on the target audio feature corresponding to each spectrum image;

and the determining module is used for determining the detection result of the loudspeaker monomer to be detected according to the target class label corresponding to each frequency spectrum image.

14. An audio detection apparatus, comprising:

the acquisition module is used for acquiring audio data to be detected;

15. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor being adapted to execute a program in the memory, the processor being adapted to perform the detection method of any one of claims 1 to 11 or to perform the method of claim 12 according to instructions in program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

16. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the detection method of any one of claims 1 to 11, or to perform the method of claim 12.

17. A computer program product comprising a computer program and instructions which, when executed by a processor, implement the detection method of any one of claims 1 to 11 or perform the method of claim 12.