CN115938354A - Audio identification method and device, storage medium and electronic equipment - Google Patents

Audio identification method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115938354A
CN115938354A CN202211521622.XA CN202211521622A CN115938354A CN 115938354 A CN115938354 A CN 115938354A CN 202211521622 A CN202211521622 A CN 202211521622A CN 115938354 A CN115938354 A CN 115938354A
Authority
CN
China
Prior art keywords
audio
data
processing
target
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211521622.XA
Other languages
Chinese (zh)
Inventor
龙海
柳杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202211521622.XA priority Critical patent/CN115938354A/en
Publication of CN115938354A publication Critical patent/CN115938354A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses an audio recognition method, an audio recognition device, a storage medium and an electronic device, wherein the audio recognition method comprises the following steps: acquiring audio data to be identified; carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified; processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata; and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result. According to the method and the device, different types of audio subdata are processed based on different audio processing parameters, and the audio recognition effect and accuracy are improved.

Description

Audio identification method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an audio recognition method and apparatus, a storage medium, and an electronic device.
Background
With the development of audio processing technology, audio identification has different application requirements on different scenes, such as on-line lectures, conferences and the like, audio data in the scenes need to be converted into text information for recording, and audio information of roles needs to be converted into real-time subtitles in movie and television shows. Generally, the audio recognition process includes that a device side collects audio, encodes and compresses the audio, transmits the audio to a service platform through a network for recognition, and returns a recognized text result. However, the audio collected by the device side has various forms, such as human voice, music and other noises, and when the main audio is recognized, the recognition effect and accuracy of the voice are reduced due to the influence of other audios.
Disclosure of Invention
In view of this, the present application provides the following technical solutions:
an audio recognition method, comprising:
acquiring audio data to be identified;
carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified;
processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata;
and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
Optionally, the classifying and detecting the audio data to be recognized to obtain an audio type corresponding to the audio data to be recognized includes:
extracting the audio features of the audio data to be identified;
determining the audio type matching each of the audio features.
Optionally, the determining the audio type matching each of the audio features includes:
inputting the audio features into an audio classification model, and outputting to obtain an audio type matched with each audio feature;
the audio classification model is a model obtained based on audio training data and deep neural network training, and the audio types at least comprise voice, music and environmental noise corresponding to the target.
Optionally, the processing the audio sub-data corresponding to each audio type based on the audio processing parameter corresponding to each audio type to obtain an audio sub-processing result corresponding to each audio sub-data includes:
determining a target audio type based on the audio type corresponding to the audio data to be identified;
determining a target audio processing parameter corresponding to the target audio type;
processing the audio subdata corresponding to the target audio type based on the target audio processing parameter to obtain a first audio subdrocessing result;
and processing the audio subdata of the audio types except the target audio type based on the initial audio processing parameters to obtain a second audio subdrocessing result.
Optionally, the determining a target audio processing parameter corresponding to the target audio type includes:
acquiring audio features corresponding to the target audio type and network features of an audio identification scene;
based on the audio features and the network features, target audio processing parameters are determined.
Optionally, the target audio processing parameter includes a target audio compression ratio and a target audio coding parameter, and the processing the audio sub-data corresponding to the target audio type based on the target audio processing parameter to obtain a first audio sub-processing result includes:
and performing compression coding processing on the audio sub-data corresponding to the target audio type based on the target compression ratio and the target audio coding parameters to obtain a first audio sub-processing result.
Optionally, the identifying, based on each audio sub-processing result, the audio data to be identified to obtain a target audio identification result, including:
obtaining audio coding data corresponding to each audio sub-processing result;
decoding the audio coded data based on the audio processing parameters to obtain audio decoded data;
and performing text recognition on the audio decoding data to obtain a target audio recognition result.
An audio recognition device, comprising:
the acquisition unit is used for acquiring audio data to be identified;
the detection unit is used for carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified;
the processing unit is used for processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata;
and the identification unit is used for identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
A storage medium having stored thereon a computer program which, when executed by a processor, implements an audio recognition method as claimed in any of the above.
An electronic device, comprising:
a memory for storing an application program and data generated by the application program running;
a processor for executing the application program to implement the audio recognition method as described in any of the above.
By the above technical solution, the present application discloses an audio recognition method, an audio recognition device, a storage medium, and an electronic device, including: acquiring audio data to be identified; carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified; processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata; and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result. According to the method and the device, different types of audio subdata are processed based on different audio processing parameters, and the audio recognition effect and accuracy are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of an audio recognition method according to an embodiment of the present application;
fig. 3 is a schematic processing flow diagram of an application scenario provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio identification apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides an audio identification method, which can be applied to scenes for identifying audio, and can be applied to audio identification scenes for converting long audio data into texts, for example, converting audio data in movie and television shows into subtitles and the like. The collected audio data to be subjected to video processing is processed based on different audio processing parameters and then transmitted to the identification end to perform audio identification, so that the tone quality, the identification efficiency and the accuracy of the audio are improved, the time delay of non-target audio can be reduced, the interference on the identification end is reduced, and the identification error rate is reduced.
Referring to fig. 1, a schematic diagram of an application scenario provided in the embodiment of the present application is shown, where an audio generation end 101 is a generation end of audio data, such as a terminal that plays audio data, for example, a video player, or a terminal that plays audio dataIs thatTerminals for audio acquisition, such as recorders. The audio collection terminal 102 is configured to collect audio data of the audio generation terminal 101, and may collect all audio data generated by the audio generation terminal 101, or collect part of audio data generated by the audio generation terminal 101, such as collecting only audio data with a time period as a target time period. After the audio acquisition end 102 acquires the audio data to be recognized, which needs to be further subjected to audio recognition, the audio data to be recognized is processed, the processed audio data to be recognized is sent to the audio recognition end 103, the audio recognition end recognizes the processed audio data to be recognized, and outputs a target audio recognition result, for example, the audio recognition end can recognize the processed audio data to be recognized as a corresponding text for output. Specifically, after the audio acquisition end 102 acquires the audio data to be identified, in order to facilitate transmission of the audio data, the audio data to be identified may be encoded, compressed, and transmitted to the audio identification end 103, and further, in order to reduce transmission delay of the audio data and improve accuracy of subsequent audio data identification in the embodiment of the present application, the audio acquisition end 102 may perform classification detection on the audio data to be identified, and different types of audio sub-data are processed by using different audio processing parameters and then output to the audio identification end 103, so that encoding delay and network transmission delay of non-target audio may be reduced, interference on the audio identification end 103 is reduced, and accuracy of audio identification is improved. TheFor a specific processing procedure of the process, reference may be made to the specific description of the embodiment shown in fig. 2, and it should be noted that the application scenario shown in fig. 1 is only an example of an application scenario of the embodiment of the present application, specifically, each device may also be selected or combined according to an actual application requirement, for example, in an actual application, an audio acquisition end and an audio recognition end may be integrated on the same server, so as to facilitate overall processing of audio data.
Referring to fig. 2, a flow chart of an audio recognition method provided in an embodiment of the present application is schematically illustrated, where the method may include the following steps:
s201, audio data to be identified are obtained.
The audio data to be identified may be audio data that needs to be identified and is sent by an audio generation end, or may be audio data obtained by monitoring the audio generation end in real time through an audio acquisition end, specifically, the audio acquisition end may be the real-time monitoring audio generation end, and determines all audio data generated by the audio generation end as the audio data to be identified, or the audio acquisition end may only acquire audio data including a target object or within a target time period range, and determines the part of audio data as the audio data to be identified. Specifically, the format of the audio data to be identified may be a data format characterized only as audio, and may also be a video data format, that is, the audio data to be identified may be audio included in a video stream. For example, an actual application scenario is that audio of an object a in a certain conference needs to be converted into text and recorded in a conference summary, at this time, audio data to be recognized is audio data generated corresponding to the object a, it should be noted that audio output by the object a in the scene may be affected by other corresponding or environmental sounds in the conference scene, and therefore, the collected audio data to be recognized includes other audio data doped while the object a generates the audio data, in addition to the audio data generated by the object a itself.
S102, carrying out classification detection on the audio data to be identified, and obtaining the audio type corresponding to the audio data to be identified.
After the audio data to be identified are obtained in the embodiment of the application, the audio data to be identified are not directly transmitted to the audio identification terminal for processing, and the audio data to be identified are not transmitted to the audio identification terminal after being integrally processed. But the type of the audio data to be identified is processed according to the type of the audio data, and then the audio data is transmitted to the audio identification end, so that the audio processing efficiency and the subsequent audio identification accuracy can be improved efficiently and pertinently.
The method for classifying and detecting the audio data to be identified to obtain the audio type corresponding to the audio data to be identified includes: extracting audio features of audio data to be identified; an audio type matching each audio feature is determined. It should be noted that the audio type of the audio data to be recognized is an audio type corresponding to each sub-audio data in the audio data to be recognized, for example, the audio data to be recognized is mixed audio data, which may include a speaker's voice, an environmental sound, background music, and the like, and the audio type of the audio data to be recognized may include audio types such as a voice, an environmental sound, and music. The audio features may refer to audio semantic features of the audio to be recognized, and may also be spectral features.
For example, if the audio feature is a spectral feature, the spectral feature of the audio data to be recognized may be obtained, the spectral feature may be any feature of a spectrum of an audio signal or a speech signal, and spectral features corresponding to audio data generated by different sound sources or different objects are different, so that each audio type included in the audio data to be recognized may be determined based on the spectral feature in a spectrogram corresponding to the audio data to be recognized. Specifically, the spectral features including various audios may be obtained by fourier transform, short-time fourier transform, or a spectral feature extraction model, or the like. Therefore, the audio to be recognized is classified according to the spectrum characteristics corresponding to each audio type, and each audio type is determined.
The audio features may also be tone features, pitch features, and volume features corresponding to different speech outputs. In order to obtain the audio type corresponding to the audio feature quickly and accurately, in an implementation manner of the embodiment of the present application, the determining the audio type matching with each audio feature includes: and inputting the audio features into the audio classification model, and outputting to obtain the audio type matched with each audio feature. The audio classification model is a model obtained based on audio training data and deep neural network training, and the audio types at least comprise voice, music and environmental noise of a target object.
In this embodiment, the audio features are classified in a machine learning manner, and an audio type corresponding to each audio feature is obtained. When an audio classification model is obtained through audio training data training, audio training data is first generated, where the audio training data includes audio data of multiple audio types, for example, each piece of audio data in the audio training data includes an audio feature corresponding to the audio data and labeled training data of an audio type matching the audio feature. Then after obtaining the audio training data, the audio training data can be divided into a training set and a test set, an initial model architecture is trained through the training set to obtain a trained model, and then the trained model is detected through the test set, namely, if the data in the test set is input into the trained model, the audio type output by the model is consistent with the audio type marked in the input test set, the trained model is accurate, if the audio type output by the model is inconsistent with the audio type marked in the input test set, the model parameter adjustment can be carried out on the previously trained model according to the test set or a newly added training sample until the error between the output of the tested model and the actually marked audio type is in a preset range, and the model obtained by training at the moment is used as an audio classification model. Subsequently, when the audio data is classified and detected, the audio features extracted from the audio data can be input into the audio classification model, so that the audio type corresponding to the audio data is output through the audio classification model. It should be noted that, the model structure of the model in the embodiment of the present application is not limited, for example, a target Voice Activity Detection (VAD) model may be used, a traditional voice activity detection model is used to determine a mute end and a non-mute end, and the target VAD model in the embodiment of the present application integrates audio features and may be used to determine audio types corresponding to different audio features, for example, a target speaker and ambient sound may be determined to obtain two types of results.
In the embodiment of the application, the audio data to be recognized may be processed subsequently according to the audio type included in the audio data to be recognized, specifically, the audio data to be recognized may be segmented according to the audio type to obtain the audio subdata, and then each of the audio subdata is processed. For example, the audio types include a first type and a second type, and the audio data corresponding to the first type is determined as first audio sub-data, and the audio data corresponding to the second type is determined as second audio sub-data. Thus, accurate processing of audio data based on different audio types can be achieved.
S103, processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata.
And S104, identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
Usually, a constant code rate is adopted to encode the collected audio data to be identified, and the encoded audio data is transmitted to an audio identification end, so that the audio identification end identifies the audio data to be identified, and an audio identification result is obtained. However, if the constant bit rate is large, the network bandwidth is occupied to affect the transmission performance when the processed audio data to be recognized is transmitted, and if the constant bit rate is low, the final recognition effect is affected. Therefore, in the embodiment of the present application, the audio processing parameters corresponding to each audio type are determined, that is, different audio processing parameters are adopted to process the audio data to be recognized, and the processed audio data is transmitted to the audio recognition end to be recognized.
In one embodiment, the processing the audio sub-data corresponding to each audio type based on the audio processing parameter corresponding to each audio type to obtain the audio sub-data corresponding to each audio sub-data, and obtaining the audio processing result corresponding to each audio sub-data includes: determining a target audio type based on the audio type corresponding to the audio data to be identified; determining a target audio processing parameter corresponding to the target audio type; processing the audio subdata corresponding to the target audio type based on the target audio processing parameter to obtain a first audio subdrocessing result; and processing the audio sub-data of the audio types other than the target audio type based on the initial audio processing parameters to obtain a second audio sub-processing result.
In this embodiment, when the target audio type is determined based on the detected audio type of the audio to be recognized, the target audio type may be determined according to an actual audio recognition requirement, for example, when the method is applied in a scene of generating a conference summary, learning content shared by the user a needs to be converted into a text and recorded, and then the audio type corresponding to the audio spoken by the user a is the target audio type, that is, the target audio type at this time is a character voice type. In order to facilitate the transmission and identification of the audio, the audio processing parameters with good effect should be used to process the audio sub-data corresponding to the target audio type. For example, the audio sub-data corresponding to the target audio type can be processed by using the coding parameters with larger code rates, so that the matching degree of the audio sub-data and the actual audio sub-data in the coding process is higher, and the identification accuracy of the part of audio sub-data is improved. In the embodiment of the application, not only can the target audio type be determined according to the actual application requirement, but also in a scene with unknown audio identification requirement, when each audio type included in the audio data to be identified is obtained, the audio type with the larger audio type is determined as the target audio type, for example, in a scene corresponding to multi-person voice interaction, voices output by different objects are respectively determined as different audio types, an object corresponding to the audio type with the increased speaking ratio is determined as corresponding to main audio output, and therefore the voice data of the object is processed by using the target audio processing parameters.
Correspondingly, the initial audio processing parameter may be a default processing parameter with a general processing effect, such as a processing parameter with a low encoding rate, which can solve the problem of network bandwidth occupation and reduce network transmission delay. Therefore, the audio sub-data of the audio types other than the target audio type can be processed according to the initial audio processing parameters, and the corresponding second audio sub-processing result is obtained. In the embodiment, only two types of audio data to be identified are distinguished, wherein one type is an audio type needing important processing, namely a target audio type; another type is other audio types than the target audio type.
In another embodiment, the audio processing parameters matching each audio type may also be determined, so that the audio sub-data corresponding to each audio type is processed based on the audio processing parameters corresponding to each audio type, which may ensure accurate processing and accurate identification of the audio sub-data corresponding to each audio type. However, it needs to determine which types of audio data are processed by using the relevant processing parameters and which types of audio data are processed by using the default parameters according to the actual network state and the processing parameters of the processing end, so as to ensure the timeliness of audio processing.
Further, the determining a target audio processing parameter corresponding to the target audio type includes: acquiring audio characteristics corresponding to the target audio type and network characteristics of an audio identification scene; based on the audio features and the network features, target audio processing parameters are determined. The audio characteristics corresponding to the target audio type may include information such as pitch, intonation, energy, and rhythm variation in voice, and may further acquire information such as frequency, amplitude, and phase of voice from the audio characteristics, and the network characteristics of the audio recognition scene may include network characteristics in the audio data transmission process, such as parameters such as network transmission rate, and may also be network characteristics corresponding to the audio recognition end in the audio recognition process, and further may determine target audio processing parameters based on information such as processing performance of the audio recognition end. The determined target processing parameters can better meet the actual application requirements. For example, if the audio characteristic corresponding to the target audio type represents that the audio sampling rate is low, and the current transmission network characteristic represents that the current network transmission rate is low, the encoding parameter in the corresponding target audio processing parameter may adopt a smaller encoding rate, so as to ensure that the volume of the encoded audio data is smaller, which is convenient for transmission, and reduce the transmission delay.
For example, in an actual application scenario, the target audio processing parameter includes a target audio compression ratio and a target audio coding parameter, and the processing of the audio sub-data corresponding to the target audio type based on the target audio processing parameter to obtain a first audio sub-processing result includes: and performing compression coding processing on the audio sub-data corresponding to the target audio type based on the target audio compression ratio and the target audio coding parameters to obtain a first audio sub-processing result.
Audio compression refers to a process of applying appropriate digital signal processing techniques to an original digital audio signal stream to reduce its code rate without losing useful information or with negligible loss. The audio data signals have correlation in time domain and frequency domain, there is also data redundancy, the audio is taken as a source, and the essence of audio coding is to reduce redundancy in the audio, so when determining the target processing parameters, mainly considering audio compression ratio and audio coding parameters, and the corresponding audio coding mode may be a general coding mode, which is not limited in the embodiments of the present application.
After the audio sub-processing result corresponding to each audio sub-data is obtained by processing the audio sub-data corresponding to each audio type based on the audio processing parameters corresponding to each audio type, the audio data to be identified is identified based on each audio sub-processing result, and a target audio identification result is obtained. That is, after obtaining each audio sub-processing result based on different audio processing parameters, each audio sub-processing result may be combined and sent to the audio identification terminal, and each audio sub-processing result is identified by the audio identification terminal. In the identification process, each received audio sub-processing result needs to be restored based on the corresponding audio processing parameter to obtain corresponding audio data, and then the audio data is identified based on the identification requirement to obtain a target audio identification result. The identification requirement may be a requirement for identifying the audio as a text, and at this time, the audio data to be identified is identified as a process of converting the audio data to be identified into a text. Correspondingly, the identification requirement can also be an audio translation requirement, and at this time, the audio data to be identified is identified into a process of converting the audio data to be identified into a text or audio of a target language.
In an implementation manner of the embodiment of the present application, the identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result includes: obtaining audio coding data corresponding to each audio sub-processing result; decoding the audio coding data based on the audio processing parameters to obtain audio decoding data; and performing text recognition on the audio decoding data to obtain a target audio recognition result. Each audio sub-processing result is obtained by processing the corresponding audio processing parameter, and usually, the audio data is encoded by the corresponding audio processing parameter, so that after the corresponding audio encoded data is obtained, the audio encoded data needs to be decoded before audio identification. Wherein the audio processing parameters at decoding are matched to the audio processing parameters at encoding. And finally, performing text recognition on the audio decoding data according to corresponding recognition requirements, such as text recognition requirements, and obtaining a target audio recognition result. For convenience of processing, the audio decoding data can be input into the target recognition model for recognition, and a target audio recognition result is obtained.
The embodiment of the application discloses an audio recognition method, which comprises the following steps: acquiring audio data to be identified; carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified; processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata; and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result. According to the method and the device, different types of audio subdata are processed based on different audio processing parameters, and the audio recognition effect and accuracy are improved.
The following describes an embodiment of the present application with a scenario in which audio of a character in a video to be recognized is used to generate a subtitle, and referring to fig. 3, a processing flow diagram of an application scenario provided in the embodiment of the present application is that, first, an apparatus terminal records and acquires audio, and a terminal playing the video to be recognized may record and acquire the audio in the video to be recognized, so as to obtain a sound stream, where the sound stream may be a PCM (pulse code Modulation) sound stream, and then detects each frame of the sound stream to obtain a sound type, and specifically, may detect through a classification detection model generated based on a VAD (voice activity detection) technique to obtain the sound type, and then may set different parameters according to the sound type, such as setting different compression ratios, different complexities, and the like, and compress and encode each frame of the sound data according to the set different parameters. For example, the target object speech is encoded using a high bit rate, and music or environmental noise is compressed using default or low bit rate parameters. For example, a specific coding algorithm may be used for coding according to the coding parameters, such as Speex or Opus algorithm. The coded byte stream can be obtained after compression coding, the coded byte stream is transmitted to a platform server for identification, and the platform server determines corresponding decoding parameters according to the received coding parameters and decodes the coded byte stream. And transmitting the decoded original voice to a recognition engine for recognition to obtain a recognition result, and outputting and feeding back the corrected recognition result as a subtitle to the client.
In the embodiment of the present application, before the audio is encoded, voice activity detection is performed, and audio types are classified and detected, for example, default or low-rate coding is used for music or noise, and high-rate coding is used for voice. The method can improve the tone quality of the audio, is beneficial to improving the recognition rate, can also reduce the coding delay and the network transmission delay of non-main audio, reduces the interference to the audio recognition process, and reduces the error rate in the text conversion process.
In another embodiment of the present application, there is also provided an audio recognition apparatus, referring to fig. 4, the apparatus may include:
an obtaining unit 401, configured to obtain audio data to be identified;
a detecting unit 402, configured to perform classification detection on the audio data to be identified, so as to obtain an audio type corresponding to the audio data to be identified;
a processing unit 403, configured to process the audio sub-data corresponding to each audio type based on the audio processing parameter corresponding to each audio type, and obtain an audio sub-processing result corresponding to each audio sub-data;
the identifying unit 404 is configured to identify the audio data to be identified based on each audio sub-processing result, so as to obtain a target audio identifying result.
The embodiment of the application discloses audio device includes: the acquisition unit acquires audio data to be identified; the detection unit carries out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified; the processing unit processes the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata; and the identification unit identifies the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result. According to the method and the device, different types of audio subdata are processed based on different audio processing parameters, and the audio recognition effect and accuracy are improved.
On the basis of the embodiment shown in fig. 4, the detection unit includes:
the extraction subunit is used for extracting the audio features of the audio data to be identified;
a first determining subunit, configured to determine an audio type matching each of the audio features.
In an embodiment, the first determining subunit is specifically configured to:
inputting the audio features into an audio classification model, and outputting to obtain an audio type matched with each audio feature;
the audio classification model is a model obtained based on audio training data and deep neural network training, and the audio types at least comprise voice, music and environmental noise of a target object.
Optionally, the processing unit comprises:
the second determining subunit is used for determining a target audio type based on the audio type corresponding to the audio data to be identified;
a third determining subunit, configured to determine a target audio processing parameter corresponding to the target audio type;
the first processing subunit is configured to process, based on the target audio processing parameter, the audio sub-data corresponding to the target audio type to obtain a first audio sub-processing result;
and the second processing subunit is used for processing the audio sub-data of the audio types other than the target audio type based on the initial audio processing parameters to obtain a second audio sub-processing result.
Further, the third determining subunit is specifically configured to:
acquiring audio features corresponding to the target audio type and network features of an audio identification scene;
based on the audio features and the network features, target audio processing parameters are determined.
Optionally, the target audio processing parameter includes a target audio compression ratio and a target audio coding parameter, where the first processing subunit is specifically configured to:
and performing compression coding processing on the audio sub-data corresponding to the target audio type based on the target audio compression ratio and the target audio coding parameters to obtain a first audio sub-processing result.
In one embodiment, the identification unit is specifically configured to:
obtaining audio coding data corresponding to each audio sub-processing result;
decoding the audio coded data based on the audio processing parameters to obtain audio decoded data;
and performing text recognition on the audio decoding data to obtain a target audio recognition result.
It should be noted that, in the present embodiment, reference may be made to the corresponding contents in the foregoing for specific implementations of each unit and sub-unit, and details are not described here.
In another embodiment of the present application, there is also provided a storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of the audio recognition method as described in any one of the above.
In another embodiment of the present application, there is also provided an electronic device, referring to fig. 5, which may include:
a memory 501 for storing an application program and data generated by the operation of the application program;
a processor 502 for executing the application to implement:
acquiring audio data to be identified;
carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified;
processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata;
and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
Optionally, the classifying and detecting the audio data to be recognized to obtain an audio type corresponding to the audio data to be recognized includes:
extracting the audio features of the audio data to be identified;
determining the audio type matching each of the audio features.
Optionally, the determining the audio type matching each of the audio features includes:
inputting the audio features into an audio classification model, and outputting to obtain an audio type matched with each audio feature;
the audio classification model is a model obtained based on audio training data and deep neural network training, and the audio types at least comprise voice, music and environmental noise corresponding to the target.
Optionally, the processing the audio sub-data corresponding to each audio type based on the audio processing parameter corresponding to each audio type to obtain an audio sub-processing result corresponding to each audio sub-data includes:
determining a target audio type based on the audio type corresponding to the audio data to be identified;
determining a target audio processing parameter corresponding to the target audio type;
processing the audio subdata corresponding to the target audio type based on the target audio processing parameter to obtain a first audio subdrocessing result;
and processing the audio sub-data of the audio types except the target audio type based on the initial audio processing parameters to obtain a second audio sub-processing result.
Optionally, the determining a target audio processing parameter corresponding to the target audio type includes:
obtaining audio features corresponding to the target audio type and network features of an audio identification scene;
based on the audio features and the network features, target audio processing parameters are determined.
Optionally, the target audio processing parameter includes a target audio compression ratio and a target audio coding parameter, and the processing the audio sub-data corresponding to the target audio type based on the target audio processing parameter to obtain a first audio sub-processing result includes:
and performing compression coding processing on the audio sub-data corresponding to the target audio type based on the target compression ratio and the target audio coding parameters to obtain a first audio sub-processing result.
Optionally, the identifying, based on each audio sub-processing result, the audio data to be identified to obtain a target audio identification result, including:
obtaining audio coding data corresponding to each audio sub-processing result;
decoding the audio coded data based on the audio processing parameters to obtain audio decoded data;
and performing text recognition on the audio decoding data to obtain a target audio recognition result.
It should be noted that, the specific implementation of the processor in this embodiment may refer to the corresponding content in the foregoing, and is not described in detail here.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An audio recognition method, comprising:
acquiring audio data to be identified;
carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified;
processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata;
and identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
2. The method according to claim 1, wherein the classifying and detecting the audio data to be identified to obtain an audio type corresponding to the audio data to be identified comprises:
extracting the audio features of the audio data to be identified;
determining the audio type matching each of the audio features.
3. The method of claim 2, the determining the type of audio that matches each of the audio features, comprising:
inputting the audio features into an audio classification model, and outputting to obtain an audio type matched with each audio feature;
the audio classification model is a model obtained based on audio training data and deep neural network training, and the audio types at least comprise voice, music and environmental noise of a target object.
4. The method of claim 1, wherein the processing the audio sub-data corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio sub-processing result corresponding to each audio sub-data comprises:
determining a target audio type based on the audio type corresponding to the audio data to be identified;
determining a target audio processing parameter corresponding to the target audio type;
processing audio sub-data corresponding to the target audio type based on the target audio processing parameters to obtain a first audio sub-processing result;
and processing the audio subdata of the audio types except the target audio type based on the initial audio processing parameters to obtain a second audio subdrocessing result.
5. The method of claim 4, the determining a target audio processing parameter corresponding to the target audio type, comprising:
obtaining audio features corresponding to the target audio type and network features of an audio identification scene;
based on the audio features and the network features, target audio processing parameters are determined.
6. The method of claim 4, wherein the target audio processing parameters comprise a target audio compression ratio and a target audio coding parameter, and the processing the audio sub-data corresponding to the target audio type based on the target audio processing parameters to obtain a first audio sub-processing result comprises:
and performing compression coding processing on the audio sub-data corresponding to the target audio type based on the target audio compression ratio and the target audio coding parameters to obtain a first audio sub-processing result.
7. The method of claim 1, wherein the identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result comprises:
obtaining audio coding data corresponding to each audio sub-processing result;
decoding the audio coded data based on the audio processing parameters to obtain audio decoded data;
and performing text recognition on the audio decoding data to obtain a target audio recognition result.
8. An audio recognition device, comprising:
the acquisition unit is used for acquiring audio data to be identified;
the detection unit is used for carrying out classification detection on the audio data to be identified to obtain an audio type corresponding to the audio data to be identified;
the processing unit is used for processing the audio subdata corresponding to each audio type based on the audio processing parameters corresponding to each audio type to obtain an audio subdrocessing result corresponding to each audio subdata;
and the identification unit is used for identifying the audio data to be identified based on each audio sub-processing result to obtain a target audio identification result.
9. A storage medium having stored thereon a computer program which, when executed by a processor, carries out the audio recognition method of any one of claims 1-7.
10. An electronic device, comprising:
a memory for storing an application program and data generated by the application program running;
a processor for executing the application program to implement the audio recognition method of any one of claims 1-7.
CN202211521622.XA 2022-11-30 2022-11-30 Audio identification method and device, storage medium and electronic equipment Pending CN115938354A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211521622.XA CN115938354A (en) 2022-11-30 2022-11-30 Audio identification method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211521622.XA CN115938354A (en) 2022-11-30 2022-11-30 Audio identification method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115938354A true CN115938354A (en) 2023-04-07

Family

ID=86555252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211521622.XA Pending CN115938354A (en) 2022-11-30 2022-11-30 Audio identification method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115938354A (en)

Similar Documents

Publication Publication Date Title
KR101101384B1 (en) Parameterized temporal feature analysis
TWI480855B (en) Extraction and matching of characteristic fingerprints from audio signals
TW200404272A (en) Controlling loudness of speech in signals that contain speech and other types of audio material
JPH08160997A (en) Method for determining pitch of speech and speech transmitting system
JP2004530153A (en) Method and apparatus for characterizing a signal and method and apparatus for generating an index signal
CN111863033B (en) Training method, device, server and storage medium for audio quality recognition model
CN107464563B (en) Voice interaction toy
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN107895571A (en) Lossless audio file identification method and device
CN106098081B (en) Sound quality identification method and device for sound file
US20160034247A1 (en) Extending Content Sources
CN111107284B (en) Real-time generation system and generation method for video subtitles
US6789066B2 (en) Phoneme-delta based speech compression
US6898272B2 (en) System and method for testing telecommunication devices
CN112767955B (en) Audio encoding method and device, storage medium and electronic equipment
JP2005530213A (en) Audio signal processing device
CN116825123A (en) Tone quality optimization method and system based on audio push
CN112885318A (en) Multimedia data generation method and device, electronic equipment and computer storage medium
CN115938354A (en) Audio identification method and device, storage medium and electronic equipment
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
US10819884B2 (en) Method and device for processing multimedia data
KR20090026504A (en) Method and apparatus for assessing audio signal spectrum
US20240153520A1 (en) Neutralizing distortion in audio data
JPH0235994B2 (en)
US5899974A (en) Compressing speech into a digital format

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination