CN111429942B

CN111429942B - Audio data processing method and device, electronic equipment and storage medium

Info

Publication number: CN111429942B
Application number: CN202010198345.8A
Authority: CN
Inventors: 范旭; 祝豪; 王妍
Original assignee: Beijing Volcano Engine Technology Co Ltd
Current assignee: Beijing Volcano Engine Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-07-14
Anticipated expiration: 2040-03-19
Also published as: CN111429942A

Abstract

The embodiment of the disclosure discloses an audio data processing method, an audio data processing device, electronic equipment and a storage medium, which can improve recognition accuracy of accents and knocking sounds. The method comprises the following steps: carrying out frequency domain processing on the audio data to be processed to obtain a spectrogram of the audio data to be processed; slicing the spectrogram according to preset pixel intervals to obtain an audio slice set; respectively carrying out sound effect recognition on each audio slice in the audio slice set by using a preset accent recognition model and a preset knocking sound recognition model to obtain a sound effect recognition result of the audio slice set; the method comprises the steps that a stress identification model is preset, and stress is predicted according to the frequency spectrum characteristics of each audio slice; the method comprises the steps that a knocking sound identification model is preset, and the knocking sound is predicted according to the frequency spectrum characteristics of each audio slice; based on the sound effect identification result, performing time domain conversion and merging on each corresponding audio slice in the audio slice set to obtain final audio data with sound effect marks.

Description

Audio data processing method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of software engineering, and in particular relates to an audio data processing method, an audio data processing device, electronic equipment and a storage medium.

Background

At present, when accent and tapping sound identification are carried out on an audio signal, some tapping sounds are inevitably and mistakenly identified as accents when accents are identified in the prior art, and all accents and tapping sounds in the audio signal cannot be clearly identified and distinguished from the audio signal at the same time, so that the identification accuracy of accents and tapping sounds in the audio signal is lower.

Disclosure of Invention

The embodiment of the disclosure provides an audio data processing method, an audio data processing device, electronic equipment and a storage medium, which can improve recognition accuracy of accents and knocking sounds, and the technical scheme of the disclosure is realized as follows:

in a first aspect, an embodiment of the present disclosure provides an audio data processing method, including:

carrying out frequency domain processing on audio data to be processed to obtain a spectrogram of the audio data to be processed;

slicing the spectrogram according to preset pixel intervals to obtain an audio slice set;

respectively carrying out sound effect recognition on each audio slice in an audio slice set by using a preset accent recognition model and a preset knocking sound recognition model to obtain a sound effect recognition result of the audio slice set; the preset accent recognition model is a model for predicting accents according to the frequency spectrum characteristics of each audio slice; the preset knocking sound identification model is a model for predicting knocking sound according to the frequency spectrum characteristics of each audio slice;

And carrying out time domain conversion and combination on each corresponding audio slice in the audio slice set based on the sound effect identification result to obtain final audio data with sound effect marks.

In the above scheme, the performing the sound effect recognition on each audio slice in the audio slice set by using the preset accent recognition model and the preset knocking sound recognition model to obtain the sound effect recognition result of the audio slice set includes:

analyzing the sound characteristics of each audio slice by using the preset accent recognition model, and predicting accent confidence of each audio slice; the stress confidence is the probability that each audio slice contains stress;

analyzing the sound characteristics of each audio slice by using the preset knocking sound identification model, and predicting the knocking sound confidence coefficient of each audio slice; the knocking sound confidence is the probability that each audio slice contains knocking sound;

and taking the stress confidence coefficient and the knocking sound confidence coefficient corresponding to each audio slice as the sound effect identification result of the audio slice set.

In the above scheme, the performing time domain conversion and merging on each corresponding audio slice in the audio slice set based on the sound effect recognition result to obtain final audio data with a sound effect mark includes:

Performing time domain conversion on the audio slice set to obtain audio data to be combined corresponding to the audio slice set; each audio slice is a section of audio data corresponding to the time slice;

determining accent audio slices and knocking audio slices in the audio data to be combined according to the sound effect identification result; the accent audio slice is an audio slice with accent confidence coefficient higher than a preset accent threshold value in the sound effect identification result; the tapping sound audio slice is an audio slice with the tapping sound confidence coefficient higher than a preset tapping sound threshold value in the sound effect identification result;

marking accent time stamps on the time slices corresponding to the accent audio slices; the accent time stamp is the center time point of the time slice corresponding to the accent audio slice;

marking a tapping sound time stamp on the time slice corresponding to the tapping sound audio slice; the tapping sound time stamp is the center time point of the time slice corresponding to the tapping sound audio slice;

and merging the accent time stamp and the knocking time stamp in the audio data to be merged to obtain the final audio data with the sound effect mark, wherein the sound effect mark is the merged time stamp.

In the above scheme, the merging the accent timestamp and the knocking timestamp in the audio data to be merged to obtain the final audio data with the sound effect mark includes:

combining stress time stamps in the audio data to be combined and combining knocking time stamps to obtain intermediate audio data;

and merging the accent time stamp and the knocking time stamp contained in the intermediate audio data again to obtain the final audio data with the sound effect mark, wherein the sound effect mark is the merged time stamp.

In the above scheme, the merging the accent time stamps and the knocking time stamps in the audio data to be merged to obtain the intermediate audio data includes:

when at least two accent time stamps exist in the audio data to be combined within a preset time interval, combining the at least two accent time stamps, and reserving the accent time stamp corresponding to the audio slice with the highest confidence level;

when at least two tapping sound time stamps exist in the audio data to be combined within the preset time interval, combining the at least two tapping sound time stamps, and reserving the tapping sound time stamp corresponding to the audio slice with the highest confidence coefficient;

And continuously detecting and combining the audio data to be combined at intervals of preset time until at least two accent time stamps are not present in the preset time interval and at least two knocking time stamps are not present in the preset time interval, so that the intermediate audio data are obtained.

In the above scheme, the merging the accent timestamp and the knocking timestamp contained in the intermediate audio data again to obtain the final audio data with the sound effect mark includes:

in the intermediate audio data, when the time interval between the accent time stamp and the knocking time stamp is smaller than the preset time interval, merging the corresponding accent time stamp and the knocking time stamp, and reserving the accent time stamp or the knocking time stamp corresponding to the audio slice with the highest confidence;

and continuously detecting and merging the intermediate audio data at preset time intervals until the time interval between the accent time stamp and the knocking time stamp is larger than the preset time interval, so as to obtain the final audio data with the sound effect marks.

In the above solution, based on the sound effect recognition result, performing time domain conversion and merging on each corresponding audio slice in the audio slice set, and after obtaining final audio data with a sound effect label, the method further includes:

In the final audio data with sound effect marks, taking the time stamp of each sound effect mark as a center point, and taking the maximum sound intensity in a preset duration range as the corresponding sound intensity of each sound effect mark;

and performing audio special effect processing based on the corresponding sound intensity of each sound effect mark to obtain the sound intensity special effect of accent and knocking sound in the audio data to be processed.

In the above scheme, before the performing frequency domain processing on the audio data to be processed to obtain the spectrogram of the audio data to be processed, the method further includes:

respectively carrying out accent labeling and knocking sound labeling on each audio sample slice in an audio sample slice set to respectively obtain accent labeling results and knocking sound labeling results of the audio sample slice set; the audio sample slice set is a slice set obtained by performing frequency domain processing and slicing on audio training sample data;

training an initial stress recognition model according to the audio sample slice set and the stress labeling result to obtain the preset stress recognition model;

training an initial knocking sound identification model according to the audio sample slice set and the knocking phonetic symbol annotation result to obtain the preset knocking sound identification model.

In the above scheme, the performing accent labeling and tapping note labeling on each audio sample slice in the audio sample slice set respectively to obtain accent labeling results and tapping note labeling results of the audio sample slice set respectively includes:

for one audio sample slice in the set of audio sample slices, marking the audio sample slice as a positive accent sample when the audio sample slice contains accents;

marking the audio sample slice as a negative accent sample when no accent is included in the audio sample slice;

taking the accent positive sample or the accent negative sample corresponding to each audio sample slice as an accent labeling result of the audio sample slice set;

for one audio sample slice in the set of audio sample slices, marking the audio sample slice as a positive tap sound sample when the each audio sample slice contains a tap sound;

marking the audio sample slice as a tapping negative sample when the audio sample slice does not contain a tapping sound;

and taking the corresponding positive tapping sound sample or the corresponding negative tapping sound sample of each audio sample slice as a tapping phonetic symbol annotation result of the audio sample slice set.

for one audio sample slice in the audio sample slice set, when the audio sample slice contains accents and the distance between the area where the accents are located and the center of the audio sample slice is smaller than a preset offset threshold value, marking the audio sample slice as an accent positive sample;

marking the audio sample slice as a negative accent sample when the audio sample slice does not contain accents;

taking the marked accent positive sample and accent negative sample as accent labeling results of the audio sample slice set;

for one audio sample slice in the audio sample slice set, when the audio sample slice comprises a knocking sound and the distance between a knocking sound area and the center of the audio sample slice is smaller than a preset offset threshold value, marking the audio sample slice as a knocking sound positive sample;

marking the audio sample slice as a tapping negative sample when the audio sample slice does not include a tapping sound;

And taking the marked positive and negative tapping sound samples as tapping phonetic symbol annotation results of the audio sample slice set.

In a second aspect, embodiments of the present disclosure provide an audio data processing apparatus, including a frequency domain processing unit, a slicing unit, an identifying unit, and a merging unit, wherein,

the frequency domain processing unit is used for performing frequency domain processing on the audio data to be processed to obtain a spectrogram of the audio data to be processed;

the slicing unit is used for slicing the spectrogram according to a preset pixel interval to obtain an audio slice set;

the recognition unit is used for respectively carrying out sound effect recognition on each audio slice in the audio slice set by using a preset accent recognition model and a preset knocking sound recognition model to obtain a sound effect recognition result of the audio slice set; the preset accent recognition model is a model for predicting accents according to the frequency spectrum characteristics of each audio slice; the preset knocking sound identification model is a model for predicting knocking sound according to the frequency spectrum characteristics of each audio slice;

and the merging unit is used for carrying out time domain conversion and merging on each corresponding audio slice in the audio slice set based on the sound effect identification result to obtain final audio data with sound effect marks.

In the above audio data processing apparatus, the identifying unit is specifically configured to analyze a sound feature of each audio slice by using the preset accent identifying model, and predict an accent confidence level of each audio slice; the stress confidence is the probability that each audio slice contains stress; analyzing the sound characteristics of each audio slice by using the preset knocking sound identification model, and predicting the knocking sound confidence of each audio slice; the knocking sound confidence is the probability that each audio slice contains knocking sound; and taking the stress confidence coefficient and the knocking sound confidence coefficient corresponding to each audio slice as the sound effect identification result of the audio slice set.

In the above audio data processing apparatus, the merging unit includes a time domain conversion unit, a determination unit, a marking unit, and a merging subunit, wherein,

the time domain conversion unit is used for performing time domain conversion on the audio slice set to obtain audio data to be combined corresponding to the audio slice set; each audio slice is a section of audio data corresponding to the time slice;

the determining unit is used for determining accent audio slices and knocking audio slices in the audio data to be combined according to the sound effect identification result; the accent audio slice is an audio slice with accent confidence coefficient higher than a preset accent threshold value in the sound effect identification result; the tapping sound audio slice is an audio slice with the tapping sound confidence coefficient higher than a preset tapping sound threshold value in the sound effect identification result;

The marking unit is used for marking accent time stamps on the time slices corresponding to the accent audio slices; the accent time stamp is the center time point of the time slice corresponding to the accent audio slice;

the marking unit is further used for marking a tapping sound time stamp on the time slice corresponding to the tapping sound audio slice; the tapping sound time stamp is the center time point of the time slice corresponding to the tapping sound audio slice;

the merging subunit is configured to merge the accent timestamp and the tapping timestamp in the audio data to be merged to obtain the final audio data with the sound effect label, where the sound effect label is the merged timestamp.

In the above audio data processing apparatus, the merging subunit is specifically configured to merge stress timestamps in the audio data to be merged and merge click timestamps, so as to obtain intermediate audio data;

In the above audio data processing apparatus, the merging subunit is specifically configured to, when at least two accent timestamps exist in the audio data to be merged in a preset time interval, merge the at least two accent timestamps, and reserve an accent timestamp corresponding to an audio slice with the highest confidence level; and merging at least two tapping sound time stamps when at least two tapping sound time stamps exist in the audio data to be merged in the preset time interval, and reserving the tapping sound time stamp corresponding to the audio slice with the highest confidence coefficient; and continuously detecting and merging the audio data to be merged at intervals of preset time until at least two accent time stamps are not existed in the preset time interval and at least two knocking time stamps are not existed in the preset time interval, thereby obtaining the intermediate audio data.

In the above audio data processing apparatus, the merging subunit is specifically configured to merge, in the intermediate audio data, a corresponding accent timestamp and a tapping sound timestamp when a time interval between the accent timestamp and the tapping sound timestamp is smaller than the preset time interval, and reserve an accent timestamp or a tapping sound timestamp corresponding to an audio slice with the highest confidence level;

In the above-mentioned audio data processing apparatus, the audio data processing apparatus further comprises an audio special effect unit, wherein,

the audio special effect unit is used for carrying out time domain conversion and combination on each corresponding audio slice in the audio slice set based on the audio identification result, obtaining final audio data with the audio mark, and taking the time stamp of each audio mark as a center point in the final audio data with the audio mark, and taking the maximum sound intensity in the preset duration range as the sound intensity corresponding to each audio mark; and performing audio special effect processing based on the corresponding sound intensity of each sound effect mark to obtain the sound intensity special effects of accents and knocking sounds in the audio data to be processed.

In the above-mentioned audio data processing device, the audio data processing device further comprises a training unit, wherein,

the training unit is used for carrying out frequency domain processing on the audio data to be processed, and before obtaining a spectrogram of the audio data to be processed, carrying out accent labeling and tapping tone labeling on each audio sample slice in the audio sample slice set respectively to obtain accent labeling results and tapping tone labeling results of the audio sample slice set respectively; the audio sample slice set is a slice set obtained by performing frequency domain processing and slicing on audio training sample data; training an initial accent recognition model according to the audio sample slice set and the accent labeling result to obtain the preset accent recognition model; and training an initial tapping sound identification model according to the audio sample slice set and the tapping phonetic symbol annotation result to obtain the preset tapping sound identification model.

In the above audio data processing device, the training unit further comprises a first sample marking unit, wherein,

the sample marking unit is used for marking one audio sample slice in the audio sample slice set as a positive stress sample when the audio sample slice contains stress; and when the audio sample slice does not contain accents, marking the audio sample slice as an accent negative sample; taking the accent positive sample or the accent negative sample corresponding to each audio sample slice as an accent labeling result of the audio sample slice set; and for one audio sample slice in the set of audio sample slices, marking the audio sample slice as a positive tap sound sample when the each audio sample slice contains a tap sound; and when the audio sample slice does not contain a tapping sound, marking the audio sample slice as a tapping sound negative sample; and taking the corresponding positive or negative tapping sound sample of each audio sample slice as a tapping phonetic symbol result of the audio sample slice set.

In the above audio data processing device, the training unit further comprises a second sample marking unit, wherein,

The second sample marking unit is configured to mark, for one audio sample slice in the audio sample slice set, the audio sample slice as an accent positive sample when the audio sample slice contains an accent and a distance between a region where the accent is located and a center of the audio sample slice is smaller than a preset offset threshold; marking the audio sample slice as a negative accent sample when the audio sample slice does not contain accents; the marked accent positive sample and accent negative sample are used as accent labeling results of the audio sample slice set, and for one audio sample slice in the audio sample slice set, when the audio sample slice comprises a knocking sound and the distance between a knocking sound area and the center of the audio sample slice is smaller than a preset offset threshold value, the audio sample slice is marked as a knocking sound positive sample; and when the audio sample slice does not include a tapping sound, marking the audio sample slice as a tapping sound negative sample; and taking the marked positive and negative tapping sound samples as tapping phonetic symbol annotation results of the audio sample slice set.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a memory, and a communication bus, where the memory communicates with the processor through the communication bus, and the memory stores one or more programs executable by the processor, and when the one or more programs are executed, the processor performs an audio data processing method as provided by the embodiment of the present disclosure.

In a fourth aspect, the disclosed embodiments provide a storage medium storing one or more programs executable by one or more processors to implement an audio data processing method as provided by the disclosed embodiments.

The embodiment of the disclosure has the following beneficial effects: the audio data processing device can analyze and identify the audio characteristics of accents and knocking sounds through the identification models of accents and knocking sounds respectively to obtain more accurate sound effect identification results of accents and knocking sounds, and further, the obtained sound effect identification results containing accents and knocking sounds are combined and de-duplicated to obtain final audio data with sound effect marks, so that the identification precision of accents and knocking sounds is further improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic diagram of the structure of an electronic device (e.g., the electronic device or server of fig. 1) 100 implementing an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an alternative architecture of an electronic device embodying embodiments of the present disclosure;

FIG. 3 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 8 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart of an alternative method of audio data processing implementing an embodiment of the present disclosure;

fig. 10 is a schematic flow chart of an alternative method of processing audio data to implement an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Referring now to fig. 1, fig. 1 is a schematic diagram of an electronic device 100 implementing an embodiment of the present disclosure. The electronic device may be various terminals including mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (PDA, personal Digital Assistant), tablet computers (PAD), portable multimedia players (PMP, portable Media Player), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital Televisions (TVs), desktop computers, and the like. The electronic device shown in fig. 1 is only one example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.

As shown in fig. 1, the electronic device 100 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 110 that may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 120 or a program loaded from a storage means 180 into a random access Memory (RAM, random Access Memory) 130. In the RAM 130, various programs and data required for the operation of the electronic device 100 are also stored. The processing device 110, the ROM 120, and the RAM 130 are connected to each other by a bus 140. An Input/Output (I/O) interface 150 is also connected to bus 140.

In general, the following devices may be connected to the I/O interface 150: input devices 160 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 170 including, for example, a liquid crystal display (LCD, liquid Crystal Display), a speaker, a vibrator, and the like; storage 180 including, for example, magnetic tape, hard disk, etc.; and a communication device 190. The communication means 190 may allow the electronic device 100 to communicate wirelessly or by wire with other devices to exchange data. While fig. 1 shows an electronic device 100 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described by the provided flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 190, or installed from the storage device 180, or installed from the ROM 120. The functions in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 110.

It should be noted that, the computer readable medium, i.e., the storage medium, according to the embodiments of the present disclosure may be a computer readable signal medium, a computer readable storage medium, or any combination of the two. The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM, erasable Programmable Read Only Memory), a flash memory, an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the disclosed embodiments, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including electrical wiring, optical fiber cable, radio Frequency (RF), the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device 100; or may exist alone without being assembled into the electronic device 100.

The computer readable medium carries one or more programs which, when executed by the electronic device 100, cause the electronic device to perform the video processing method provided by the embodiments of the present disclosure.

Computer program code for carrying out operations in embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computers may be connected to the user computer through any kind of network, including a local area network (LAN, local Area Network)) and a wide area network (WAN, wide Area Network), or may be connected to external computers (e.g., connected through the internet using an internet service provider).

The flowcharts and block diagrams provided by the embodiments of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described in the embodiments of the present disclosure may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field programmable gate array (FPGA, field-Programmable Gate Array), an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a special standard product (ASSP, application Specific Standard Parts)), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of the disclosed embodiments, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Units and/or modules in an audio data processing device are provided below in connection with embodiments of the present disclosure. It will be appreciated that the units or modules in the apparatus may be implemented in the electronic device shown in fig. 1 in the form of software (e.g. a computer program stored in a computer software program as described above) or in the electronic device shown in fig. 1 in the form of hardware logic components (e.g. FPGs a, ASIC, ASSP, SOC and CPLDs) as described above.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Referring to fig. 2, fig. 2 is an alternative structural schematic diagram of an audio data processing device 200 implementing an embodiment of the present disclosure, showing the following modules: a frequency domain processing unit 210, a slicing unit 220, an identifying unit 230, and a combining unit 240.

It should be noted that the above classification of units does not constitute a limitation on the electronic device itself, for example, some units may be split into two or more sub-units, or some units may be combined into one new unit.

It should also be noted that the names of the above units do not constitute limitations on the units themselves in some cases, and for example, the above frequency domain processing unit 210 may also be described as a unit that "performs frequency domain processing on audio data to be processed to obtain a spectrogram of the audio data to be processed".

For the same reason, elements and/or modules not described in detail in the electronic device do not represent defaults of corresponding elements and/or modules, and any operations performed by the electronic device may be performed by corresponding elements and/or modules in the electronic device.

With continued reference to fig. 3, fig. 3 is a schematic flowchart showing an alternative method for implementing the audio data processing method according to the embodiment of the present disclosure, for example, when the processing device 801 loads a program in the read-only memory (ROM) 102 or a program in the storage device 180 into the Random Access Memory (RAM), the audio data processing method shown in fig. 3 may be implemented when the program is executed, and the steps shown in fig. 3 are described below.

S101, performing frequency domain processing on the audio data to be processed to obtain a spectrogram of the audio data to be processed.

In the embodiment of the disclosure, an audio data processing device firstly performs frequency domain processing on audio data to be processed to obtain a spectrogram of the audio data to be processed.

In the embodiment of the disclosure, the audio data to be processed is a section of audio in the time domain, in which accents and knocking sounds need to be identified, and the audio data processing device firstly converts the audio data to be processed in the time domain into the frequency domain so as to perform further identification processing according to the spectral characteristics of the audio data to be processed.

In the embodiment of the present disclosure, the frequency domain processing of the audio data to be processed may use a fourier transform method, or may use other methods, which is not limited by the embodiment of the present disclosure.

S102, slicing the spectrogram according to preset pixel intervals to obtain an audio slice set.

In the embodiment of the disclosure, after the audio data processing device obtains a spectrogram of audio data to be processed, the audio data processing device slices the spectrogram according to a preset pixel interval, and divides the spectrogram into at least one audio slice with a fixed amplitude, so as to obtain an audio slice set.

In some embodiments, the preset pixel interval may be 4 pixels, or may be other preset values, specifically selected according to practical situations, which is not limited in the embodiments of the present disclosure

S103, respectively carrying out sound effect recognition on each audio slice in the audio slice set by using a preset accent recognition model and a preset knocking sound recognition model to obtain a sound effect recognition result of the audio slice set; the method comprises the steps that a stress identification model is preset, and stress is predicted according to the frequency spectrum characteristics of each audio slice; the preset tapping sound identification model is a model for predicting the tapping sound according to the frequency spectrum characteristics of each audio slice.

In the embodiment of the disclosure, after the audio data processing device obtains the audio slice set, a preset accent model may be used to perform spectrum analysis on each audio slice in the audio slice set, identify whether each audio slice contains accent, and further perform spectrum analysis on each audio slice in the audio slice set by using a preset tapping model, and identify whether each audio slice contains tapping tone, so as to obtain an audio effect identification result of the audio slice set.

In the embodiment of the disclosure, accents or knocking sounds in a spectrogram are different from spectrum features of other sounds, a preset accent recognition model is a model which is pre-trained according to accent spectrum features and can predict the possibility that each audio slice contains accents according to the spectrum features of the audio slice; the preset knocking sound identification model is a model which is trained according to the frequency spectrum characteristics of the knocking sound in advance and can predict the possibility that each audio slice contains the knocking sound according to the frequency spectrum characteristics of the audio slice.

In the embodiment of the disclosure, after performing sound effect recognition on each audio slice in the audio slice set by using the preset accent recognition model and the preset tapping sound recognition model, each audio slice in the audio slice set obtains an accent recognition result and a tapping sound recognition result corresponding to the audio slice, and the audio data processing device can obtain the sound effect recognition result of the audio slice set.

In an embodiment of the present disclosure, the sound effect recognition result characterizes a likelihood that each corresponding audio slice in the set of audio slices contains accents and tap tones, respectively.

S104, based on the sound effect recognition result, performing time domain conversion and combination on each corresponding audio slice in the audio slice set to obtain final audio data with sound effect marks.

In the embodiment of the disclosure, after the audio data processing device obtains the audio recognition result, the audio data processing device may perform time domain conversion and merging on each corresponding audio slice in the audio slice set based on the audio recognition result to obtain final audio data with the audio mark.

In the embodiment of the disclosure, since the sound effect recognition result recognized according to each audio slice in the audio slice set may be misrecognized or repeatedly recognized, the audio processing apparatus may first convert each audio slice in the audio slice set to the time domain, combine the sound effect recognition results corresponding to each audio slice in the time domain, remove the sound effect recognition results related to the accent or the knocking sound which is less likely or adjacent, and finally obtain the final audio data with the sound effect mark.

It can be appreciated that in the embodiment of the disclosure, the audio data processing device may analyze and identify the audio features of the accent and the tapping sound through the identification models of the accent and the tapping sound, respectively, so as to obtain more accurate recognition results of the accent and the tapping sound, and further, combine and de-duplicate the obtained recognition results of the accent and the tapping sound, so as to obtain final audio data with the sound effect marks, thereby further improving the recognition accuracy of the accent and the tapping sound.

In the embodiment of the present disclosure, based on fig. 3, in S103, using a preset accent recognition model and a preset tapping tone recognition model to respectively perform sound effect recognition on each audio slice in the audio slice set, the obtained sound effect recognition result of the audio slice set may be specifically shown in fig. 4, including S1031-S1033, as follows:

s1031, analyzing the sound characteristics of each audio slice by using a preset accent recognition model, and predicting accent confidence of each audio slice; stress confidence is the probability that each audio slice contains stress.

In the embodiment of the disclosure, the audio data processing apparatus uses a preset accent recognition model to analyze according to the sound feature, i.e., the spectrum feature, of each audio slice, and predicts the probability that each audio slice contains accents as the accent confidence of each audio slice.

S1032, analyzing the sound characteristics of each audio slice by using a preset knocking sound identification model, and predicting the knocking sound confidence coefficient of each audio slice; the tap confidence is the probability that each audio slice contains a tap.

In the embodiment of the disclosure, the audio data processing apparatus uses a preset tapping sound recognition model to analyze according to the sound feature, i.e., the spectrum feature, of each audio slice, and predicts the probability that each audio slice contains a tapping sound as the tapping sound confidence of each audio slice.

S1033, taking the stress confidence coefficient and the knocking confidence coefficient corresponding to each audio slice as the sound effect identification result of the audio slice set.

In the embodiment of the disclosure, the audio data processing device uses the stress confidence coefficient and the knocking confidence coefficient corresponding to each audio slice as the sound effect recognition result of the audio slice set.

It can be appreciated that in the embodiment of the disclosure, as the respective sound features of the accent and the tapping sound are respectively analyzed and identified, confusion of identification between the accent and the tapping sound possibly generated by using the same set of identification model is avoided, and the identification precision of the accent and the tapping sound is improved.

In the embodiment of the disclosure, based on the result of the sound effect recognition in fig. 4 and S104, time domain conversion and merging are performed on each corresponding audio slice in the audio slice set, and the obtained final audio data with the sound effect mark may be as shown in fig. 5, including S1041-S1045, as follows:

s1041, performing time domain conversion on the audio slice set to obtain audio data to be combined corresponding to the audio slice set; each audio slice is a segment of audio data corresponding to a time slice.

In the embodiment of the disclosure, in order to combine the sound effect recognition results corresponding to the audio slice sets, the audio data processing device firstly converts the audio slice sets obtained by frequency domain division back to the time domain, and each audio segment corresponds to the audio data of one time segment in the time domain, so as to obtain the audio data to be combined corresponding to the audio slice sets.

S1042, determining an accent audio slice and a knocking audio slice in the audio data to be combined according to the sound effect identification result; the accent audio slice is an audio slice with accent confidence coefficient higher than a preset accent threshold value in the sound effect identification result; the tapping sound audio slice is an audio slice with the tapping sound confidence coefficient higher than a preset tapping sound threshold value in the sound effect identification result.

In the embodiment of the present disclosure, according to the sound effect recognition result obtained in S103, the audio data processing apparatus determines an accent audio slice and a tapping audio slice in the obtained audio data to be combined.

In the embodiment of the disclosure, the audio effect recognition result is that the audio slice predicted by the preset accent recognition model contains the confidence coefficient of accent, and the audio slice predicted by the preset tapping tone recognition model contains the confidence coefficient of tapping tone, so that the audio data processing device determines the audio slice with the accent confidence coefficient higher than the preset accent threshold value in the audio effect recognition result as the accent audio slice; and determining the audio slice with the knocking sound confidence coefficient higher than a preset knocking sound threshold value in the sound effect identification result as a knocking sound audio slice.

In some embodiments, when the accent confidence in the sound effect recognition result is higher than 0.5, the audio slice may be determined as a accent audio slice, when the accent confidence in the sound effect recognition result is higher than 0.5, the audio slice may be determined as a tapping audio slice, the preset accent threshold and the preset tapping threshold may also be selected as other thresholds according to the specific situation, and the embodiments of the present disclosure are not limited.

S1043, marking accent time stamps on the time slices corresponding to the accent audio slices; the accent timestamp is the center point in time of the time segment corresponding to the accent audio slice.

In an embodiment of the disclosure, the audio data processing device marks the accent time stamp on the time slice corresponding to the identified accent audio slice.

In the embodiment of the disclosure, one audio slice corresponds to one time slice in the time domain, and the audio data processing device marks the accent timestamp corresponding to the accent audio slice at the center time point of the time slice.

S1044, marking a tapping sound time stamp on a time slice corresponding to the tapping sound audio slice; the tapping sound time stamp is the center time point of the time slice corresponding to the tapping sound audio slice.

In the embodiment of the disclosure, the audio data processing device marks the time slice corresponding to the identified tapping sound audio slice with a tapping sound time stamp.

S1045, merging the accent time stamp and the knocking time stamp in the audio data to be merged to obtain final audio data with sound effect marks, wherein the sound effect marks are merged time stamps.

In the embodiment of the present disclosure, after the processing in S1043 and S1044, the accent timestamp and the tapping timestamp are printed in the audio data to be combined, and because the accent and the tapping are respectively identified by the preset accent identification model and the preset tapping identification model, the obtained accent timestamp and the tapping timestamp may have partial repeated identification, and for the case that one phoneme marks the accent timestamp and the tapping timestamp at the same time, the audio data processing apparatus combines the accent timestamp and the tapping timestamp in the audio data to be combined, removes the repeated timestamp mark, and obtains the final audio data with the sound effect mark.

In the disclosed embodiment, the sound effects are marked as merged time stamps.

It can be appreciated that in the embodiment of the disclosure, the recognition results of sound effects are converted into the time domain for time stamp marking, and the recognition results between accents and knocking sounds can be combined based on the time stamp, so that the probability of false recognition and repeated recognition is reduced, and the recognition accuracy of accents and knocking sounds is improved.

In the embodiment of the present disclosure, based on fig. 5, merging the accent timestamp and the tapping timestamp in the audio data to be merged in S1045 to obtain final audio data with an audio effect label, the audio effect label may be specifically shown in fig. 6, including S201-S202, as follows:

s201, merging stress time stamps in the audio data to be merged and merging tapping time stamps to obtain intermediate audio data.

In the embodiment of the disclosure, when the audio data processing device merges the accent time stamp and the tapping time stamp in the audio data to be merged, the audio data processing device firstly merges only all accent time stamps in the audio data to be merged between accent time stamps in the audio data to be merged, and merges only all tapping time stamps in the audio data to be merged between tapping time stamps in the audio data to be merged.

In the embodiment of the disclosure, the audio processing device respectively combines the accent time stamp and the knocking time stamp in the audio data to be combined to obtain audio data as intermediate audio data.

In the embodiment of the disclosure, based on fig. 6, S201 may specifically be shown in fig. 7, including S2011-S2013, as follows:

And S2011, merging at least two accent time stamps when at least two accent time stamps exist in the audio data to be merged in a preset time interval, and reserving the accent time stamp corresponding to the audio slice with the highest confidence.

In the embodiment of the disclosure, the audio data processing device detects audio data to be combined in a range of every preset time interval, and when two or more accent timestamps appear in the preset time interval, the audio data processing device combines the corresponding two or more accent timestamps.

In the embodiment of the disclosure, a specific strategy for the audio data processing apparatus to combine two or more corresponding accent timestamps may be to reserve the accent timestamp corresponding to the audio slice with the highest confidence, and delete the rest accent timestamps to reduce the false recognition rate of accents.

And 2012, when at least two tapping sound time stamps exist in the audio data to be combined within a preset time interval, combining the at least two tapping sound time stamps, and reserving the tapping sound time stamp corresponding to the audio slice with the highest confidence coefficient.

In the embodiment of the disclosure, the audio data processing device detects audio data to be combined in a range of every preset time interval, and when two or more tapping sound time stamps appear in the preset time interval, the audio data processing device combines the corresponding two or more tapping sound time stamps.

In the embodiment of the disclosure, a specific strategy for the audio data processing apparatus to merge two or more corresponding tapping sound timestamps may be to reserve the tapping sound timestamp corresponding to the audio slice with the highest confidence, and delete the rest of the accent timestamps to reduce the false recognition rate of accents.

S2013, continuously detecting and merging the audio data to be merged at intervals of preset time until at least two accent time stamps are not existed in the preset time interval and at least two knocking time stamps are not existed in the preset time interval, so that the intermediate audio data are obtained.

In the embodiment of the disclosure, the audio data processing apparatus continues to detect and merge audio data to be merged at preset time intervals by the method in S2021-S2022 until two or more accent time stamps are not detected within the preset time intervals, and two or more tapping time stamps are not detected within the preset time intervals, and the audio data at this time is taken as the obtained intermediate audio data.

S202, merging the accent time stamp and the knocking time stamp contained in the intermediate audio data again to obtain final audio data with sound effect marks, wherein the sound effect marks are merged time stamps.

In the embodiment of the disclosure, after the audio data processing device obtains the intermediate audio data, the accent timestamp and the tapping timestamp contained in the intermediate audio data are brought into a range to be combined together, the accent timestamp and the tapping timestamp are combined again, and the accent timestamp and the tapping timestamp carried in the combined audio data are used as sound effect marks, so that final audio data with the sound effect marks is obtained

In the embodiment of the disclosure, based on fig. 7, S202 may specifically be as shown in fig. 8, including S2021-S2022, as follows:

s2021, in the intermediate audio data, when the time interval between the accent time stamp and the tapping time stamp is smaller than the preset time interval, merging the corresponding accent time stamp and the tapping time stamp, and reserving the accent time stamp or the tapping time stamp corresponding to the audio slice with the highest confidence.

In the embodiment of the disclosure, the audio data processing device checks whether the time interval between the accent time stamp and the knocking time stamp is smaller than a preset time interval in the intermediate audio data, and when the time interval is smaller than the preset time interval, the audio data processing device merges the corresponding accent time stamp and the knocking time stamp, and only retains the accent time stamp or the knocking time stamp corresponding to the audio slice with the highest confidence level.

In some embodiments, the preset time interval is 200ms, and the 1.2s of the intermediate audio data has a accent timestamp with a confidence level of 0.7; if the 1.3s is provided with the knocking sound time stamp with the confidence coefficient of 0.8, the audio data processing device deletes the accent time stamp with the confidence coefficient of 0.7 at the 1.2s, and the knocking sound time stamp with the confidence coefficient of 0.8 at the 1.3s is reserved, so that the corresponding accent time stamp and the knocking sound time stamp are combined.

S2022, detecting and merging the intermediate audio data at preset time intervals continuously until the time interval between the accent time stamp and the knocking time stamp is larger than the preset time interval, so that final audio data with sound effect marks are obtained.

In the embodiment of the disclosure, the audio data processing device continuously detects and merges the intermediate audio data at a preset time interval, and detects and merges the intermediate audio data for a plurality of times until the time intervals between all accent time stamps and all knocking time stamps are larger than the preset time interval, so as to obtain the final audio data with the sound effect marks.

It can be appreciated that in the embodiment of the disclosure, the merging is performed in the accent timestamp and the tapping timestamp respectively, and then the merging results of the accent timestamp and the tapping timestamp are merged again, so that the recognition result is further optimized, the probability of false recognition and repeated recognition is reduced, and the recognition accuracy of accent and tapping is improved.

In the embodiment of the present disclosure, based on fig. 3 and S104, as shown in fig. 9, S105 to S106 may be included, as follows:

s105, taking the time stamp of each sound effect mark as a center point in the final audio data with the sound effect mark, and taking the maximum sound intensity in the preset time range as the corresponding sound intensity of each sound effect mark.

In the embodiment of the present disclosure, after obtaining the final audio data with the sound effect marks, the audio processing apparatus may use the timestamp of each sound effect mark as a center point, and use the maximum sound intensity data within the preset duration range as the corresponding sound intensity at each sound effect mark.

In some embodiments, the preset duration range may be [ -frame size/2, frame size/2], where frame size is a duration occupied by one frame of audio frame, and other preset duration ranges may also be set, and specifically, the preset duration range is selected according to practical situations, which is not limited in the embodiments of the present disclosure.

S106, performing audio special effect processing based on the corresponding sound intensity of each sound effect mark to obtain the sound intensity special effect of accent and knocking sound in the audio data to be processed.

In the embodiment of the disclosure, the audio processing device may perform audio special effect processing based on the corresponding sound intensity at each sound effect marker, so as to obtain the sound intensity special effects of accent and knocking sound in the audio data to be processed.

It can be appreciated that in the embodiment of the disclosure, after the accent and the knocking of the audio data to be processed are performed to obtain the final audio data with the sound effect marks, the sound intensity on each sound effect mark can be obtained, so as to perform further audio special effect processing.

In the embodiment of the disclosure, based on the final audio data with the sound effect mark, the time period in which the sound effect mark is located may also be obtained, further special effect processing is performed based on the time period, and the like, specifically, the application is performed according to the actual situation, and the embodiment of the disclosure is not limited.

In the embodiment of the disclosure, based on fig. 3, before S101, S301 to S303 may also be included as shown in fig. 10, as follows:

s301, respectively carrying out accent labeling and knocking sound labeling on each audio sample slice in an audio sample slice set to respectively obtain accent labeling results and knocking sound labeling results of the audio sample slice set; the audio sample slice set is a slice set obtained by performing frequency domain processing and slicing on audio training sample data.

In the embodiment of the disclosure, in order to obtain a preset accent recognition model and a preset tapping sound recognition model, an audio data processing device firstly respectively performs accent labeling and tapping sound labeling on each audio sample slice in an audio sample slice set to respectively obtain accent labeling results and tapping sound labeling results of the audio sample slice set.

In an embodiment of the disclosure, the audio sample slice set is a slice set obtained by performing frequency domain processing and slicing on audio training sample data, wherein each adjacent audio sample slice in the audio sample slice set has a fixed overlapping area.

In the embodiment of the present disclosure, a specific implementation of S301 may be as follows:

for one audio sample slice in the set of audio sample slices, marking the audio sample slice as a positive accent sample when the audio sample slice contains accents; marking the audio sample slice as a negative accent sample when no accent is included in the audio sample slice; and taking the accent positive sample or the accent negative sample corresponding to each audio sample slice as an accent labeling result of the audio sample slice set. And, for one audio sample slice in the set of audio sample slices, when each audio sample slice contains a tap sound, marking the audio sample slice as a tap sound positive sample; marking the audio sample slice as a tapping negative sample when the audio sample slice does not contain a tapping sound; and taking the corresponding positive tapping sound sample or the corresponding negative tapping sound sample of each audio sample slice as a tapping phonetic symbol annotation result of the audio sample slice set.

In the embodiment of the present disclosure, another specific implementation of S301 may be as follows:

for one audio sample slice in the audio sample slice set, when the audio sample slice contains accents and the distance between the area where the accents are located and the center of the audio sample slice is smaller than a preset offset threshold value, marking the audio sample slice as an accent positive sample; when the audio sample slice does not contain accents, marking the audio sample slice as an accent negative sample; and taking the marked accent positive sample and accent negative sample as accent labeling results of the audio sample slice set. And for one audio sample slice in the audio sample slice set, when the audio sample slice comprises a knocking sound and the distance between a knocking sound area and the center of the audio sample slice is smaller than a preset offset threshold value, marking the audio sample slice as a knocking sound positive sample; when the audio sample slice does not include a tapping sound, marking the audio sample slice as a tapping sound negative sample; and taking the marked positive and negative tapping sound samples as tapping phonetic symbol results of the audio sample slice set.

S302, training an initial stress recognition model according to the audio sample slice set and the stress labeling result to obtain a preset stress recognition model.

According to the embodiment of the disclosure, the audio data processing device can train the initial accent recognition model according to the audio sample slice set and the accent labeling result, the accuracy of the initial accent recognition model in recognizing accents is gradually improved, the recognition result is gradually close to the accent labeling result, and the preset accent recognition model is obtained after training is finished.

S303, training an initial tapping sound identification model according to the audio sample slice set and the tapping phonetic symbol annotation result to obtain a preset tapping sound identification model.

According to the embodiment of the disclosure, the audio data processing device can train the initial tapping sound identification model according to the audio sample slice set and the tapping phonetic symbol annotation result, the accuracy of the initial stress sound identification model for identifying the tapping sound is gradually improved, the identification result is gradually close to the tapping phonetic symbol annotation result, and the preset tapping sound identification model is obtained after the training is finished.

It can be appreciated that in the embodiment of the disclosure, in addition to labeling accents, the audio data processing apparatus labels the accents, and trains the initial accent recognition model and the initial accent recognition model respectively through labeled samples, so that a preset accent recognition model obtained by training and a preset accent recognition model can be ensured to clearly distinguish accents from accents, and the false recognition rate is reduced, thereby improving the recognition accuracy of accents and accents.

According to one or more embodiments of the present disclosure, there is provided an audio data processing apparatus including: the device comprises a frequency domain processing unit, a slicing unit, an identification unit and a merging unit, wherein,

In some embodiments, the identifying unit is specifically configured to analyze the sound feature of each audio slice using the preset stress identifying model, and predict a stress confidence level of each audio slice; the stress confidence is the probability that each audio slice contains stress; analyzing the sound characteristics of each audio slice by using the preset knocking sound identification model, and predicting the knocking sound confidence of each audio slice; the knocking sound confidence is the probability that each audio slice contains knocking sound; and taking the stress confidence coefficient and the knocking sound confidence coefficient corresponding to each audio slice as the sound effect identification result of the audio slice set.

In some embodiments, the merging unit comprises a time domain conversion unit, a determination unit, a marking unit and a merging subunit, wherein,

In some embodiments, the merging subunit is specifically configured to merge stress time stamps in the audio data to be merged and merge click time stamps, so as to obtain intermediate audio data;

In some embodiments, the merging subunit is specifically configured to, when at least two accent timestamps exist in the audio data to be merged in a preset time interval, merge the at least two accent timestamps, and reserve an accent timestamp corresponding to an audio slice with the highest confidence level; and merging at least two tapping sound time stamps when at least two tapping sound time stamps exist in the audio data to be merged in the preset time interval, and reserving the tapping sound time stamp corresponding to the audio slice with the highest confidence coefficient; and continuously detecting and merging the audio data to be merged at intervals of preset time until at least two accent time stamps are not existed in the preset time interval and at least two knocking time stamps are not existed in the preset time interval, thereby obtaining the intermediate audio data.

In some embodiments, the merging subunit is specifically configured to merge, in the intermediate audio data, the corresponding accent timestamp and the tapping sound timestamp when the time interval between the accent timestamp and the tapping sound timestamp is smaller than the preset time interval, and reserve the accent timestamp or the tapping sound timestamp corresponding to the audio slice with the highest confidence level;

In some embodiments, the audio special effect unit is configured to perform time domain conversion and merging on each corresponding audio slice in the audio slice set based on the audio identification result, obtain final audio data with audio marks, and then use a timestamp of each audio mark as a center point in the final audio data with audio marks, and use a maximum sound intensity in a preset duration range as a sound intensity corresponding to each audio mark; and performing audio special effect processing based on the corresponding sound intensity of each sound effect mark to obtain the sound intensity special effects of accents and knocking sounds in the audio data to be processed.

In some embodiments, the training unit is configured to perform frequency domain processing on audio data to be processed, and before obtaining a spectrogram of the audio data to be processed, perform stress labeling and tapping note labeling on each audio sample slice in an audio sample slice set, so as to obtain a stress labeling result and a tapping note labeling result of the audio sample slice set respectively; the audio sample slice set is a slice set obtained by performing frequency domain processing and slicing on audio training sample data; training an initial accent recognition model according to the audio sample slice set and the accent labeling result to obtain the preset accent recognition model; and training an initial tapping sound identification model according to the audio sample slice set and the tapping phonetic symbol annotation result to obtain the preset tapping sound identification model.

In some embodiments, the training unit further comprises a first sample marking unit, wherein,

In some embodiments, the training unit further comprises a second sample marking unit, wherein,

According to one or more embodiments of the present disclosure, there is provided an electronic device including a processor, a memory, and a communication bus, the memory being in communication with the processor through the communication bus, the memory storing one or more programs executable by the processor, the processor performing an audio data processing method as provided by the embodiments of the present disclosure when the one or more programs are executed.

According to one or more embodiments of the present disclosure, there is provided a storage medium storing one or more programs executable by one or more processors to implement an audio data processing method as provided by the embodiments of the present disclosure.

The foregoing description is only illustrative of the embodiments of the present disclosure and the technical principles employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of processing audio data, comprising:

2. The method of claim 1, wherein the performing the sound effect recognition on each audio slice in the audio slice set using the preset accent recognition model and the preset tap tone recognition model to obtain the sound effect recognition result of the audio slice set includes:

3. The method of claim 2, wherein the performing time-domain conversion and merging on each corresponding audio slice in the audio slice set based on the sound effect recognition result to obtain final audio data with sound effect marks includes:

4. A method according to claim 3, wherein said merging the accent time stamp and the tapping time stamp in the audio data to be merged to obtain the final audio data with the sound effect flag comprises:

5. The method of claim 4, wherein merging between accent timestamps and merging between tapping timestamps in the audio data to be merged to obtain intermediate audio data comprises:

And continuously detecting and combining the audio data to be combined at intervals of preset time until at least two accent time stamps are not existed in the preset time interval and at least knocking time stamps are not existed in the preset time interval, thereby obtaining the intermediate audio data.

6. The method of claim 5, wherein the re-merging between the accent time stamp and the tapping time stamp contained in the intermediate audio data to obtain the final audio data with the sound effect mark comprises:

7. The method of claim 1, wherein the performing time-domain conversion and merging on each corresponding audio slice in the audio slice set based on the sound effect recognition result, after obtaining final audio data with sound effect marks, the method further comprises:

8. The method of claim 1, wherein prior to performing frequency domain processing on the audio data to be processed to obtain a spectrogram of the audio data to be processed, the method further comprises:

9. The method of claim 8, wherein the respectively performing accent labeling and tapping note labeling on each audio sample slice in the set of audio sample slices to respectively obtain accent labeling results and tapping note labeling results for the set of audio sample slices, comprises:

10. The method of claim 8, wherein the respectively performing accent labeling and tapping note labeling on each audio sample slice in the set of audio sample slices to respectively obtain accent labeling results and tapping note labeling results for the set of audio sample slices, comprises:

11. An audio data processing device is characterized by comprising a frequency domain processing unit, a slicing unit, an identification unit and a merging unit, wherein,

12. An electronic device, comprising: a processor, a memory and a communication bus, the memory being in communication with the processor via the communication bus, the memory storing one or more programs executable by the processor, the processor performing the method of any of claims 1-10 when the one or more programs are executed.

13. A storage medium storing one or more programs executable by one or more processors to implement the method of any of claims 1-10.