CN117275461B

CN117275461B - Multitasking audio processing method, system, storage medium and electronic equipment

Info

Publication number: CN117275461B
Application number: CN202311571465.8A
Authority: CN
Inventors: 孔欧
Original assignee: Shanghai Mido Technology Co ltd
Current assignee: Shanghai Mido Technology Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-03-15
Anticipated expiration: 2043-11-23
Also published as: CN117275461A

Abstract

The invention provides a multitasking audio processing method, a multitasking audio processing system, a storage medium and electronic equipment, wherein the multitasking audio processing method comprises the following steps: acquiring audio for training; acquiring tag information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription; and training a multi-task audio processing model based on the audio and the tag information so as to acquire the tag information of the audio to be processed based on the trained multi-task audio processing model. The multitasking audio processing method, the multitasking audio processing system, the storage medium and the electronic equipment can compatibly complete a plurality of audio processing tasks, and hardware resources and processing time consumption are effectively reduced.

Description

Multitasking audio processing method, system, storage medium and electronic equipment

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a multitasking audio processing method, a multitasking audio processing system, a storage medium and electronic equipment.

Background

Audio processing tasks generally include: language classification, voice endpoint detection, speech recognition, and speech translation. In the prior art, a method or a neural network model is designed for each task independently to process. For example, when performing speech recognition, it is necessary to determine the time start point and the time end point of speaking through voice end point detection, determine the language type through a language classification model, and finally perform recognition based on a speech recognition model corresponding to the language type.

However, different audio processing tasks often adopt different network structure designs, and deploying multiple models requires more hardware resources and more time consumption, which is not beneficial to actual popularization and use.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a multitasking audio processing method, system, storage medium and electronic device, which can compatibly complete a plurality of audio processing tasks, and effectively reduce hardware resources and processing time.

In a first aspect, the present invention provides a method of multitasking audio processing, the method comprising the steps of: acquiring audio for training; acquiring tag information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription; and training a multi-task audio processing model based on the audio and the tag information so as to acquire the tag information of the audio to be processed based on the trained multi-task audio processing model.

In an implementation manner of the first aspect, acquiring tag information of the audio includes the following steps:

setting a label start field;

when the audio does not contain voice information, a no-voice field and a tag end field are set. At this time, the tag information includes only the tag start field, the no-speech field, and the tag end field;

when the audio contains voice information, determining language information of the audio;

determining a voice-to-text mode according to the language information;

when the audios are voice information, setting an unmanned sound endpoint detection field, acquiring text information corresponding to the audios and setting a label ending field, wherein the label information comprises a label starting field, language information, a voice-to-text mode, an unmanned sound endpoint detection field, the text information corresponding to the audios and the label ending field;

when the audio contains multiple pieces of voice information, determining audio starting time, audio corresponding text information and audio ending time of each piece of voice information, and setting a tag ending field, wherein the tag information contains the tag starting field, the language information, the voice-to-text mode, the audio starting time, the audio corresponding text information and the audio ending time of each piece of voice information and the tag ending field.

In one implementation manner of the first aspect, the multitasking audio processing model includes a feature extraction layer, an encoding layer, a decoding layer, a full connection layer and a softmax that are connected in sequence;

the feature extraction layer is used for extracting audio features of the audio;

the coding layer is used for coding the audio characteristics to obtain coding characteristics;

the decoding layer is used for decoding the coding features to obtain decoding features;

the full connectivity layer and the softmax are for outputting tag information of the audio based on the decoding characteristics.

In an implementation manner of the first aspect, the feature extraction layer extracts features of the audio to be processed includes:

and converting the waveform of the audio to be processed into audio characteristics based on a Log-Mel spectrogram.

In one implementation of the first aspect, the coding layer includes a plurality of cascaded coding modules, the coding modules including a cascaded self-attention mechanism and a multi-layer perceptron; the decoding layer includes a plurality of cascaded decoding modules including a cascaded self-attention mechanism, a cross-attention mechanism, and a multi-layer perceptron.

In one implementation manner of the first aspect, the number of the encoding modules and the decoding modules are the same.

In an implementation manner of the first aspect, the encoding modules and the decoding modules are sequentially in one-to-one correspondence, and each encoding module is connected to a corresponding decoding module.

In a second aspect, the present invention provides a multitasking audio processing system, comprising a first acquisition module, a second acquisition module, and a training module;

the first acquisition module is used for acquiring audio for training;

the second acquisition module is used for acquiring the label information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription;

the training module is used for training the multi-task audio processing model based on the audio and the tag information so as to acquire the tag information of the audio to be processed based on the trained multi-task audio processing model.

In a third aspect, the present invention provides an electronic device comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, so that the electronic device executes the above-mentioned multitasking audio processing method.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by an electronic device, implements the above-described multitasking audio processing method.

As described above, the multitasking audio processing method, system, storage medium and electronic device of the present invention have the following advantages.

(1) Multiple audio processing tasks can be compatibly completed based on a unified method.

(2) And different network structures are not required to be set for different audio processing tasks, so that hardware resources and processing time consumption are effectively reduced.

(3) The intelligent degree is high, and the practicability is high.

Drawings

Fig. 1 is a schematic view of an electronic device according to an embodiment of the invention.

Fig. 2 is a flowchart of a method for processing multi-task audio according to an embodiment of the invention.

Fig. 3 is a schematic diagram showing the tag information acquisition according to an embodiment of the invention.

Fig. 4 is a schematic diagram of a multi-task audio processing system according to an embodiment of the invention.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

The following embodiments of the present invention provide a multitasking audio processing method that can be applied to an electronic device as shown in fig. 1. The electronic device in the present invention may include a mobile phone 11, a tablet computer 12, a notebook computer 13, a wearable device, a vehicle-mounted device, an augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) device, an Ultra-Mobile Personal Computer (UMPC), a netbook, a personal digital assistant (Personal Digital Assistant, PDA) and the like with a wireless charging function, and the specific type of the electronic device is not limited in the embodiments of the present invention.

For example, the electronic device may be a Station (ST) in a wireless charging enabled WLAN, a wireless charging enabled cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (WirelessLocal Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) device, a wireless charging enabled handheld device, a computing device or other processing device, a computer, a laptop computer, a handheld communication device, a handheld computing device, and/or other devices for communicating over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network, a mobile terminal in a future evolved public land mobile network (PublicLand Mobile Network, PLMN), or a mobile terminal in a future evolved Non-terrestrial network (Non-terrestrial Network, NTN), etc.

For example, the electronic device may communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (GlobalSystem of Mobile communication, GSM), general Packet radio service (General Packet RadioService, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short Messaging Service, SMS), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (Global Positioning System, GPS), a global navigation satellite system (Global Navigation Satellite System, GLONASS), a beidou satellite navigation system (BeiDou navigation Satellite System, BDS), a Quasi zenith satellite system (Quasi-Zenith Satellite System, QZSS) and/or a satellite based augmentation system (Satellite Based Augmentation Systems, SBAS).

The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.

As shown in fig. 2, in an embodiment, the method for processing multi-tasking audio of the present invention includes steps S1 to S3.

Step S1, acquiring audio for training.

Specifically, audio is acquired through an audio acquisition device, the internet and the like to serve as a training set for model training.

S2, acquiring tag information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription.

Specifically, the corresponding tag information is obtained by analyzing and processing the audio. The label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information. Therefore, four tasks of language classification, voice endpoint detection, voice recognition and voice translation in audio processing can be simultaneously realized according to the tag information.

In one embodiment, as shown in fig. 3, the step of obtaining the tag information of the audio includes the following steps.

21 A tag start field SOS is set.

22 When the audio does not contain SPEECH information, the NO SPEECH field NO SPEECH and the end of tag field EOS are set. At this time, the tag information includes only the tag start field SOS, the NO SPEECH field NO SPEECH, and the tag end field EOS.

23 When the audio contains voice information, determining LANGUAGE information LANGUAGE TAG of the audio.

24 And determining a voice-to-text mode according to the LANGUAGE information LANGUAGE TAG. The voice conversion method comprises voice translation TRANSLATE and voice conversion TRANSCRIBE. For example, when the default LANGUAGE is chinese, if the LANGUAGE information LANGUAGE TAG is english, the speech-to-text manner adopts speech translation TRANSLATE, that is, the english speech is translated into the corresponding chinese text. If the LANGUAGE information LANGUAGE TAG is Chinese, the voice-to-text mode adopts voice transcription TRANSCRBE, namely, chinese voice is written into corresponding Chinese text.

25 When the audio is voice information, setting an unmanned sound endpoint detection field NO TIMESTAMPS, acquiring text information corresponding to the audio and setting a TAG end field EOS, wherein the TAG information comprises a TAG start field SOS, LANGUAGE information LANGUAGE TAG, a voice-to-text mode TRANSCRBE/TRANSLATE, the unmanned sound endpoint detection field NO TIMESTAMPS, text information text corresponding to the audio and the TAG end field EOS.

26 When the audio contains multiple pieces of voice information, determining audio start time begin time, audio corresponding text information text and audio end time of each piece of voice information, and setting a label end field EOS, wherein the label information contains the label start field SOS, the LANGUAGE information LANGUAGE TAG, the voice-to-text mode TRANSCRIBE/TRANSLATE, the audio start time begin time, the audio corresponding text information text and the audio end time of each piece of voice information, and the label end field EOS.

For example, a certain audio is set to include a plurality of pieces of voice, such as: i are Chinese people, I are very loving, and the start and stop time of each section of voice is [ 0.0,1 ], and [ 1.2,2.5 ]. Then it is known that the language is Chinese and the speech-to-text mode is phonetic writing. Thus, the tag information corresponding to the audio may be expressed as: SOS+Chinese (LANGUAGE TAG) +TRANSCRIBE+0.0 (begin time) +I are Chinese (text) +1 (end time) +1.2 (begin time) +I love (text) +2.5 (end time) +EOS. Wherein the content inside brackets indicates the meaning of a field, not contained in the tag information.

And step S3, training a multi-task audio processing model based on the audio and the tag information so as to acquire the tag information of the audio to be processed based on the trained multi-task audio processing model.

Specifically, the multitasking audio processing model is used for simultaneously realizing task processing such as language classification, voice endpoint detection, voice recognition, voice translation and the like of audio. And when training the multitasking audio processing model, training the model based on the audio and the corresponding label information. After model training is completed, the audio to be processed is input into the multi-task audio processing model, and the multi-task audio processing model can output corresponding label information. According to the label information, task processing such as language classification, voice endpoint detection, voice recognition and voice translation of the audio are realized at the same time, and models do not need to be trained for each task, so that the performance of the models is improved, and the power consumption of the system is reduced.

In an embodiment, the multi-tasking audio processing model comprises a feature extraction layer, an encoding layer, a decoding layer, a full connectivity layer and a softmax connected in sequence.

The feature extraction layer is used for extracting audio features of the audio. The feature extraction layer converts the waveform of the audio to be processed into audio features based on a Log-Mel spectrogram.

The encoding layer is used for encoding the audio features to obtain encoding features.

Specifically, the coding layer comprises a plurality of cascade coding modules, and the coding modules comprise a cascade Self-Attention mechanism Self-Attention and a multi-layer perceptron MLP. The decoding layer comprises a plurality of cascaded decoding modules, and the decoding modules comprise a Self-Attention mechanism, a Cross-Attention mechanism and a multi-layer perceptron MLP. The number of the coding modules is the same as that of the decoding modules, the coding modules are in one-to-one correspondence in sequence, and each coding module is connected with the corresponding decoding module. Namely, in the cascade of the encoding modules and decoding modules, the first encoding module is connected with the first decoding module, the second encoding module is connected with the second decoding module, the third encoding module is connected with the third decoding module, and so on.

The protection scope of the multitasking audio processing method according to the embodiment of the present invention is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present invention are included in the protection scope of the present invention.

The embodiment of the invention also provides a multitasking audio processing system, which can implement the multitasking audio processing method of the invention, but the implementation device of the multitasking audio processing system of the invention includes but is not limited to the structure of the multitasking audio processing system listed in the embodiment, and all the structural modifications and substitutions of the prior art made according to the principles of the invention are included in the protection scope of the invention.

As shown in fig. 4, in an embodiment, the multi-task audio processing system of the present invention includes a first obtaining module 41, a second obtaining module 42 and a training module 43.

The first acquisition module 41 is configured to acquire audio for training.

The second obtaining module 42 is configured to obtain tag information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription.

The training module 43 is connected to the first obtaining module 41 and the second obtaining module 42, and is configured to train a multitasking audio processing model based on the audio and the tag information, so as to obtain tag information of the audio to be processed based on the trained multitasking audio processing model.

The structures and principles of the first acquiring module 41, the second acquiring module 42 and the training module 43 are in one-to-one correspondence with the steps in the above-mentioned multitasking audio processing method, so that the description thereof will not be repeated here.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus, or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.

The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. For example, functional modules/units in various embodiments of the invention may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The embodiment of the invention also provides a computer readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

The embodiment of the invention also provides electronic equipment. The electronic device includes a processor and a memory.

The memory is used for storing a computer program.

The memory includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

The processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the multi-task audio processing method.

Preferably, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

As shown in FIG. 5, the electronic device of the present invention is embodied in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors or processing units 51, a memory 52, a bus 53 that connects the various system components, including the memory 52 and the processing unit 51.

Bus 53 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 52 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 521 and/or cache memory 522. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 523 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be coupled to bus 53 through one or more data medium interfaces. Memory 52 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 524 having a set (at least one) of program modules 5241 may be stored in, for example, memory 52, such program modules 5241 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 5241 generally perform the functions and/or methods in the described embodiments of the invention.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 54. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through the network adapter 55. As shown in fig. 5, the network adapter 55 communicates with other modules of the electronic device over the bus 53. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A method of multitasking audio processing, the method comprising the steps of:

acquiring audio for training;

acquiring tag information of the audio; the label information comprises audio language information, a voice-to-text mode, voice start-stop time and audio corresponding text information, and the voice-to-text mode comprises voice translation and voice transcription;

training a multitasking audio processing model based on the audio and the tag information to obtain tag information of the audio to be processed based on the trained multitasking audio processing model;

the multi-task audio processing model comprises a feature extraction layer, an encoding layer, a decoding layer, a full connection layer and a softmax which are sequentially connected;

the full connectivity layer and the softmax are for outputting tag information of the audio based on the decoding feature;

the coding layer comprises a plurality of cascade coding modules, wherein the coding modules comprise a cascade self-attention mechanism and a multi-layer perceptron; the decoding layer comprises a plurality of cascaded decoding modules, wherein the decoding modules comprise a cascade self-attention mechanism, a cross-attention mechanism and a multi-layer perceptron;

the step of obtaining the tag information of the audio comprises the following steps:

setting a label start field;

when the audio does not contain voice information, setting a no-voice field and a label end field; the tag information only comprises the tag start field, the no-speech field and the tag end field;

determining a voice-to-text mode according to the language information;

2. The multi-tasking audio processing method of claim 1 wherein: the feature extraction layer extracts features of the audio to be processed, including:

3. The multi-tasking audio processing method of claim 1 wherein: the number of the encoding modules and the decoding modules are the same.

4. A method of multitasking audio processing according to claim 3, characterized in that: the coding modules and the decoding modules are sequentially in one-to-one correspondence, and each coding module is connected with the corresponding decoding module.

5. A multitasking audio processing system, characterized in that the system comprises a first acquisition module, a second acquisition module and a training module;

the first acquisition module is used for acquiring audio for training;

the training module is used for training a multi-task audio processing model based on the audio and the tag information so as to acquire tag information of the audio to be processed based on the trained multi-task audio processing model;

setting a label start field;

determining a voice-to-text mode according to the language information;

when the audio contains multiple pieces of voice information, determining audio starting time, audio corresponding text information and audio ending time of each piece of voice information, and setting a tag ending field, wherein the tag information contains the tag starting field, the language information, the voice-to-text mode, the audio starting time, the audio corresponding text information and the audio ending time of each piece of voice information and the tag ending word.

6. An electronic device, the electronic device comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, so that the electronic device performs the multitasking audio processing method of any one of claims 1 to 4.

7. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by an electronic device, implements the multitasking audio processing method of any of claims 1 to 4.