CN112562712A

CN112562712A - Recording data processing method and system, electronic equipment and storage medium

Info

Publication number: CN112562712A
Application number: CN202011549737.0A
Authority: CN
Inventors: 吴光需; 梁志婷
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-26

Abstract

The invention provides a recording data processing method, a system, electronic equipment and a storage medium, wherein the method comprises a first pickup step, a second pickup step and a third pickup step, wherein a first pickup device worn by a first speaker is used for collecting the sound of the first speaker, transmitting the sound to a second pickup device and storing the sound as a first audio track; a second sound pickup step of picking up the sound of a second party by using the second sound pickup apparatus and storing the sound as a second sound track; an audio generating step of processing the first audio track and the second audio track into an intermediate audio file using the second sound pickup apparatus; and audio separation, namely performing voice role separation on the intermediate audio file. The invention solves the problems of higher cost and poor effect of the existing recording processing method.

Description

Recording data processing method and system, electronic equipment and storage medium

Technical Field

The invention belongs to the field of audio processing, and particularly relates to a recording data processing method and system applicable to a field recording technology, electronic equipment and a storage medium.

Background

In the scene through the recording of recording equipment in the current trade, through the recording equipment like three wheat, four wheat recording equipment improve the recording in-process distinguish the method that the person of wearing, the character sound of interlocutor were drawed through the recording separation to the recording process that the person of speaking reaches better effective recording, lower noise effect, changeing.

In the method, the non-effective recording of the environment is noise, the separation of the audio role recording is easily greatly interfered, errors are easily caused after the separation of the wearer and the interlocutor recording role, the scheme is solved by using higher cost and more complex technology, and the use cost is greatly improved in a commercial scene.

Disclosure of Invention

The embodiment of the application provides a recording data processing method, a recording data processing system, electronic equipment and a storage medium, and aims to at least solve the problems of high cost and poor effect of the existing recording processing method.

In a first aspect, an embodiment of the present application provides a method for processing recorded sound data, including: a first pickup step, using a first pickup device worn by a first speaker to pick up the sound of the first speaker, transmitting the sound to a second pickup device, and storing the sound as a first sound track; a second sound pickup step of picking up the sound of a second party by using the second sound pickup apparatus and storing the sound as a second sound track; an audio generating step of processing the first audio track and the second audio track into an intermediate audio file using the second sound pickup apparatus; and audio separation, namely performing voice role separation on the intermediate audio file.

Preferably, the intermediate audio file is a dual-track single audio file or two dual-track single audio files.

Preferably, the audio separating step further comprises: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

Preferably, the audio separating step further comprises: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

In a second aspect, an embodiment of the present application provides a recording data processing system, which is suitable for the above recording data processing method, and includes: a first sound pickup unit: collecting the sound of a first speaker by using a first sound pickup device worn on the first speaker, transmitting the sound to a second sound pickup device, and storing the sound as a first sound track; the second sound pickup unit is used for collecting the sound of a second conversation party by using the second sound pickup equipment and storing the sound as a second sound track; an audio generation unit that processes the first audio track and the second audio track into an intermediate audio file using the second sound pickup apparatus; and the audio separation unit is used for separating voice roles of the intermediate audio file.

In some of these embodiments, the intermediate audio file is one dual-track single audio file or two dual-track single audio files.

In some of these embodiments, the audio separation unit further comprises: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

In some of these embodiments, the audio separation unit further comprises: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the sound recording data processing method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements a recording data processing method as described in the first aspect.

Compared with the related art, the recording data processing method provided by the embodiment of the application can be used for recording more accurate audio source recording in a mode with lower cost and lower technical difficulty.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a recording data processing method according to the present invention;

FIG. 2 is a block diagram of a recorded data processing system of the present invention;

FIG. 3 is a block diagram of an electronic device of the present invention;

in the above figures:

1. a first sound pickup unit; 2. a second sound pickup unit; 3. an audio generation unit; 4. an audio separation unit; 60. a bus; 61. a processor; 62. a memory; 63. a communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In a real environment, a speech signal of interest is usually interfered by noise, so that the intelligibility of speech is seriously damaged, and the performance of speech recognition is reduced. Front-end speech separation techniques are one of the most common methods for noise. A good front-end speech separation module can greatly improve the speech intelligibility and the recognition performance of an automatic speech recognition system.

From a signal processing point of view, many methods propose estimating the power spectrum of the noise or ideal wiener filters, such as spectral subtraction and wiener filtering. Where wiener filtering is the optimal filter to separate clean speech in the least mean square error sense. Given a noisy speech, it can infer the spectral coefficients of the speech given a priori distributions of speech and noise. Signal processing based methods typically assume that the noise is stationary or slowly varying. These methods can achieve better separation performance when the assumed conditions are satisfied. Compared with a signal processing method, the model-based method utilizes pure signals before mixing to respectively construct models of voice and noise, and important performance improvement is achieved under the condition of low signal-to-noise ratio. Among model-based speech separation methods, non-negative matrix factorization is a common modeling method that can mine local basis representation in non-negative data, and is currently widely applied to speech separation. Computational auditory scene analysis is another important speech separation technique that attempts to solve the speech separation problem by simulating the processing of sound by the human ear. The basic computational goal of computational auditory scene analysis is to estimate an ideal binary mask, which achieves speech separation based on the auditory masking of the human ear. Compared with other voice separation methods, the computational auditory scene analysis has no any hypothesis on noise and has better generalization performance.

Speech separation aims at separating useful signals from disturbed speech signals, a process that can naturally represent a supervised learning problem. A typical supervised speech separation system typically learns a mapping function from noisy features to separation targets, such as an ideal mask or a magnitude spectrum of the speech of interest, through a supervised learning algorithm, such as a deep neural network.

Embodiments of the invention are described in detail below with reference to the accompanying drawings:

fig. 1 is a flowchart of a recording data processing method according to the present invention, and referring to fig. 1, the recording data processing method according to the present invention includes the following steps:

s1: the first sound pickup equipment worn by the first speaker is used for collecting the sound of the first speaker, transmitting the sound to the second sound pickup equipment and storing the sound as the first sound track.

In the implementation, the recording object is divided into a wearer and a corresponding interlocutor, and the first sound pickup equipment is worn on the body of the wearer, optionally, the first sound pickup equipment may be an earphone; the first sound pickup device is used for collecting the sound of a wearer within a certain distance, and optionally, the certain distance can be 0.2 m.

In a specific implementation, after the sound of the wearer is collected, the recorded sound data is transmitted to a second sound pickup device, and optionally, the second sound pickup device may be a sound recorder; the second sound-collecting apparatus saves the received sound of the wearer in the form of one sound track.

In this step, a pickup device worn by the first talker is used to achieve targeted approach pickup.

S2: and collecting the sound of a second speaker by using the second sound pickup equipment, and storing the sound as a second sound track.

In a specific implementation, the second sound pickup device is used for collecting the sound of an interlocutor corresponding to the wearer, and optionally, the second sound pickup device is arranged within a certain distance radius of the interlocutor; alternatively, the certain distance may be 2 meters.

In an implementation, the second sound pick-up device saves the picked-up sound of the wearer in the form of one sound track.

S3: processing the first audio track and the second audio track into an intermediate audio file using the second pickup apparatus.

Optionally, the intermediate audio file is a dual-track single audio file or two dual-track single audio files.

In a specific implementation, the sound of the wearer and the sound of the interlocutor are stored in the form of two sound tracks respectively, and the second sound pickup equipment can select the two sound tracks to be processed in different forms; alternatively, the two tracks may be combined into one audio file with two tracks; alternatively, the two audio tracks may be separately generated into two audio files of a single audio track, respectively.

S4: and carrying out voice role separation on the intermediate audio file.

Optionally, the audio separating step further includes: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

Optionally, the audio separating step further includes: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

In specific implementation, the intermediate audio file is separated through a voice separation algorithm, and the audio of the non-wearer in the audio track of the wearer is eliminated; and (4) outputting the audio of the non-interlocutors in the audio track of the interlocutors to obtain an audio file with less noise.

In specific implementation, the embodiment of the present application uses a fixed human voice as a main separation object, and therefore, optionally, a method based on spectrum mapping may be used for the voice separation algorithm, where the method based on spectrum mapping is to let a model learn a mapping relationship from a spectrum with interference to a spectrum without interference (clean voice) through supervised learning; the model may be DNN, CNN, LSTM or even GAN.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment of the application provides a recording data processing system, which is suitable for the recording data processing method. As used below, the terms "unit," "module," and the like may implement a combination of software and/or hardware of predetermined functions. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible and contemplated.

FIG. 2 is a block diagram of a recording data processing system according to the present invention, referring to FIG. 2, including:

first sound pickup unit 1: the first sound pickup equipment worn by the first speaker is used for collecting the sound of the first speaker, transmitting the sound to the second sound pickup equipment and storing the sound as the first sound track.

In the unit, a pickup device worn by a first speaker is used to realize targeted approach pickup.

Second sound pickup unit 2: and collecting the sound of a second speaker by using the second sound pickup equipment, and storing the sound as a second sound track.

The audio generation unit 3: processing the first audio track and the second audio track into an intermediate audio file using the second pickup apparatus.

The audio separation unit 4: and carrying out voice role separation on the intermediate audio file.

Optionally, the audio separation unit 4 further includes: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

Optionally, the audio separation unit 4 further includes: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

In addition, a recording data processing method described in conjunction with fig. 1 may be implemented by an electronic device. Fig. 3 is a block diagram of an electronic device of the present invention.

The electronic device may comprise a processor 61 and a memory 62 in which computer program instructions are stored.

Specifically, the processor 61 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 62 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 62 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 62 may include removable or non-removable (or fixed) media, where appropriate. The memory 62 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 62 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 62 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 62 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 61.

The processor 61 realizes any one of the sound recording data processing methods in the above-described embodiments by reading and executing computer program instructions stored in the memory 62.

In some of these embodiments, the electronic device may also include a communication interface 63 and a bus 60. As shown in fig. 3, the processor 61, the memory 62, and the communication interface 63 are connected via a bus 60 to complete communication therebetween.

The communication port 63 may be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

The bus 60 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 60 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 60 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 60 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device can execute the recording data processing method in the embodiment of the application.

In addition, in combination with the recording data processing method in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement the recording data processing method. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the sound recording data processing methods in the above embodiments.

And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for processing recorded data, comprising:

a first pickup step, using a first pickup device worn by a first speaker to pick up the sound of the first speaker, transmitting the sound to a second pickup device, and storing the sound as a first sound track;

a second sound pickup step of picking up the sound of a second party by using the second sound pickup apparatus and storing the sound as a second sound track;

an audio generating step of processing the first audio track and the second audio track into an intermediate audio file using the second sound pickup apparatus;

and audio separation, namely performing voice role separation on the intermediate audio file.

2. The audio recording data processing method of claim 1, wherein the intermediate audio file is one dual-track mono audio file or two dual-track mono audio files.

3. The recorded sound data processing method of claim 1, wherein the audio separating step further comprises: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

4. The recorded sound data processing method of claim 1 or 3, wherein the audio separating step further comprises: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

5. A recorded sound data processing system, comprising:

a first sound pickup unit: collecting the sound of a first speaker by using a first sound pickup device worn on the first speaker, transmitting the sound to a second sound pickup device, and storing the sound as a first sound track;

the second sound pickup unit is used for collecting the sound of a second conversation party by using the second sound pickup equipment and storing the sound as a second sound track;

an audio generation unit that processes the first audio track and the second audio track into an intermediate audio file using the second sound pickup apparatus;

and the audio separation unit is used for separating voice roles of the intermediate audio file.

6. The audio recording data processing system of claim 5 wherein said intermediate audio file is one dual track single audio file or two dual track single audio files.

7. The recorded sound data processing system of claim 5, wherein the audio separation unit further comprises: and eliminating the sound of a person who is not a first speaker in the first audio track in the intermediate audio file, and eliminating the sound of a person who is not a second speaker in the second audio track.

8. The recorded sound data processing system of claim 5 or 7, wherein the audio separating unit further comprises: and performing voice role separation on the intermediate audio file by using a voice separation algorithm.

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the sound recording data processing method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the sound recording data processing method according to any one of claims 1 to 4.