CN111640450A

CN111640450A - Multi-person audio processing method, device, equipment and readable storage medium

Info

Publication number: CN111640450A
Application number: CN202010401608.0A
Authority: CN
Inventors: 黄族良; 陈昊亮
Original assignee: Guangzhou Speakin Intelligent Technology Co ltd
Current assignee: Guangzhou Speakin Intelligent Technology Co ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-09-08

Abstract

The invention discloses a multi-person audio processing method, a device, equipment and a readable storage medium, wherein the multi-person audio processing method is used for segmenting audio to be detected, so that subsequent operation is facilitated, and the processing efficiency of the audio to be detected is improved; primarily screening multi-person sound paragraphs in the audio clips through the characteristic information of the audio clips; invalid paragraphs in the initial multi-voice paragraphs are further screened by combining semantic recognition results of the initial multi-voice paragraphs, so that valid single-voice parts in the audio to be detected are reserved to a great extent, and the effectiveness and the utilization rate of the rest parts in the audio to be detected are improved.

Description

Multi-person audio processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for processing multi-user audio.

Background

In an actual acoustic environment, there are often a number of different human voices and other murmurs present at the same time. This mixed situation of multiple voices brings many troubles to speech recognition and audio processing. Especially, in the process of archiving and storing the voice audio, the situation that the multi-person voice existing in the mixed audio is unqualified is solved. Because the existing technology for separating target voice from multi-voice has not reached the mature level yet, in the actual audio processing work, the whole mixed audio containing multi-voice is usually discarded, so that the technical problem of low utilization rate of multi-voice audio is caused.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a multi-person sound audio processing method, aiming at solving the technical problem that the utilization rate of multi-person sound audio is low.

In order to achieve the above object, the present invention provides a multi-user audio processing method, which is applied to a multi-user audio processing device, and the multi-user audio processing method includes the following steps:

acquiring audio to be detected, and dividing the audio to be detected into a plurality of audio segments according to a preset time interval, wherein the audio to be detected comprises a multi-person sound part and a single-person sound part;

acquiring a plurality of feature information corresponding to a plurality of audio segments, and identifying initial multi-voice paragraphs in the plurality of audio segments according to preset multi-voice feature conditions and the feature information;

and obtaining a semantic recognition result of the initial multi-voice paragraph, and determining and separating a target multi-voice paragraph in the initial multi-voice paragraph according to the semantic recognition result.

Optionally, the step of identifying an initial multiple-voices paragraph in the multiple audio segments according to the preset multiple-voices characteristic condition and the multiple characteristic information includes:

performing Fourier transform on the plurality of audio segments to acquire the plurality of frequency domain information;

respectively judging whether the maximum frequency domain amplitude values in the plurality of frequency domain information meet the preset multi-person audio domain condition;

if so, taking the audio segment corresponding to the current maximum frequency domain amplitude as an initial multi-person sound segment;

after the step of respectively determining whether the maximum frequency domain amplitude values in the plurality of frequency domain information satisfy the preset multi-person sound frequency domain condition, the method further includes:

and if not, taking the audio segment corresponding to the current maximum frequency domain amplitude as a single voice paragraph.

Optionally, the step of respectively determining whether the maximum frequency domain amplitude values in the plurality of frequency domain information satisfy the preset multi-user audio domain condition includes:

respectively judging whether the maximum frequency domain amplitude value in the plurality of frequency domain information exceeds a preset threshold value according to the difference value between the time sequence and the average value of the maximum frequency domain amplitude values of the previous or subsequent frequency domain information;

if the maximum frequency domain amplitude exceeds a preset threshold value, judging that the maximum frequency domain amplitude meets the preset multi-person audio domain condition;

and if the maximum frequency domain amplitude does not exceed the preset threshold, judging that the maximum frequency domain amplitude does not meet the preset multi-person sound frequency domain condition.

Optionally, the step of obtaining a semantic recognition result of the initial multi-voice passage, and determining and separating a target multi-voice passage in the initial multi-voice passage according to the semantic recognition result includes:

inputting the initial multi-person voice paragraph into a preset semantic recognition model to obtain a semantic recognition result;

determining semantic segmentation points in the initial multi-person voice paragraph according to the semantic recognition result;

and taking the voice paragraphs divided by the semantic segmentation points as the target multi-voice paragraphs, and separating the target multi-voice paragraphs from the initial multi-voice paragraphs.

Optionally, before the step of obtaining the audio to be detected and dividing the audio to be detected into a plurality of audio segments according to a preset time interval, the method further includes:

performing word segmentation processing on preset text data to obtain an attribute sequence of words in the preset text data;

vectorizing the attribute sequence to obtain a word vector corresponding to the attribute sequence;

splicing the word vector with a text vector of corresponding text data to generate input data;

and training the input data and the corresponding semantic input result as a training data set to obtain the preset semantic recognition model.

Optionally, after the step of determining and separating the target multiple voice paragraphs in the audio clip according to the semantic recognition results of the multiple voice paragraphs and the audio clip, the method further includes:

and splicing the audio segments separated out from the laggard target multi-person sound segments according to a time sequence to generate target single-person sound audio.

and acquiring initial audio, and performing noise reduction processing on the initial audio by using a preset convolutional neural network model to generate the audio to be detected.

In order to achieve the above object, the present invention provides a multi-person audio processing device including:

the audio clip segmentation module is used for acquiring audio to be detected and dividing the audio to be detected into a plurality of audio clips according to a preset time interval, wherein the audio to be detected comprises a multi-voice part and a single-voice part;

the initial paragraph identification module is used for acquiring a plurality of feature information corresponding to a plurality of audio segments and identifying initial multi-voice paragraphs in the plurality of audio segments according to preset multi-voice feature conditions and the plurality of feature information;

and the target paragraph generation module is used for acquiring the semantic recognition result of the initial multi-voice paragraph, and determining and separating the target multi-voice paragraph in the initial multi-voice paragraph according to the semantic recognition result.

Optionally, the initial paragraph identifying module includes:

a frequency domain information obtaining unit, configured to perform fourier transform on the multiple audio segments to obtain the multiple pieces of frequency domain information;

the frequency domain condition judging unit is used for respectively judging whether the maximum frequency domain amplitude in the plurality of frequency domain information meets the preset multi-person audio frequency domain condition;

the initial paragraph generating unit is used for taking the audio segment corresponding to the current maximum frequency domain amplitude as an initial multi-person sound paragraph if the maximum frequency domain amplitude meets the requirement;

the initial paragraph identification module further comprises:

and the single person paragraph generating unit is used for taking the audio segment corresponding to the current maximum frequency domain amplitude as a single person sound paragraph if the single person paragraph does not meet the requirement.

Optionally, the initial paragraph identifying module includes:

an amplitude difference value judging unit, configured to respectively judge whether a difference value between a maximum frequency domain amplitude value in the plurality of frequency domain information and a maximum frequency domain amplitude value average value of preceding or following frequency domain information according to a time sequence exceeds a preset threshold value;

the first condition judging unit is used for judging that the maximum frequency domain amplitude value meets the preset multi-person audio frequency domain condition if a preset threshold value is exceeded;

and the second condition judgment unit is used for judging that the maximum frequency domain amplitude value does not meet the preset multi-person sound frequency domain condition if the condition does not exceed a preset threshold value.

Optionally, the target paragraph generation module includes:

a semantic recognition obtaining unit, configured to input the initial multiple voice paragraphs into a preset semantic recognition model, and obtain a semantic recognition result;

a semantic segmentation determining unit, configured to determine semantic segmentation points in the initial multi-voice passage according to the semantic recognition result;

and the target paragraph separating unit is used for taking the voice paragraphs divided by the semantic segmentation points as the target multi-voice paragraphs and separating the target multi-voice paragraphs from the initial multi-voice paragraphs.

Optionally, the multi-person sound audio processing apparatus further includes:

the attribute sequence acquisition module is used for carrying out word segmentation processing on preset text data to acquire an attribute sequence of words in the preset text data;

the word vector acquisition module is used for carrying out vectorization processing on the attribute sequence to acquire a word vector corresponding to the attribute sequence;

the input data generation module is used for splicing the word vector and the text vector of the corresponding text data to generate input data;

and the semantic model generating module is used for training the input data and the corresponding semantic input result as a training data set to acquire the preset semantic recognition model.

Optionally, the multi-person sound audio processing apparatus further includes:

and the single audio acquisition module is used for splicing the audio segments separated out from the laggard target multi-person sound segments according to a time sequence to generate target single sound audio.

Optionally, the multi-person sound audio processing apparatus further includes:

and the audio noise reduction processing module is used for acquiring initial audio, and performing noise reduction processing on the initial audio by using a preset convolutional neural network model to generate the audio to be detected.

Further, to achieve the above object, the present invention also provides a multi-person sound audio processing apparatus comprising: the multi-person audio processing program is stored on the memory and can be operated on the processor, and when being executed by the processor, the multi-person audio processing program realizes the steps of the multi-person audio processing method.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium having a multi-person audio processing program stored thereon, which, when executed by a processor, implements the steps of the multi-person audio processing method as described above.

The invention provides a multi-person audio processing method, a multi-person audio processing device, a multi-person audio processing equipment and a computer readable storage medium. The multi-person audio processing method comprises the steps of dividing audio to be detected into a plurality of audio segments according to a preset time interval by obtaining the audio to be detected, wherein the audio to be detected comprises a multi-person audio part and a single-person audio part; acquiring a plurality of feature information corresponding to a plurality of audio segments, and identifying initial multi-voice paragraphs in the plurality of audio segments according to preset multi-voice feature conditions and the feature information; and obtaining a semantic recognition result of the initial multi-voice paragraph, and determining and separating a target multi-voice paragraph in the initial multi-voice paragraph according to the semantic recognition result. By the mode, the audio to be detected is segmented, so that the subsequent operation is facilitated, and the processing efficiency of the audio to be detected is improved; primarily screening multi-person sound paragraphs in the audio clips through the characteristic information of the audio clips; invalid paragraphs in the initial multi-voice paragraphs are further screened by combining semantic recognition results of the initial multi-voice paragraphs, effective single-voice parts in the audio to be detected are reserved to a great extent, effectiveness and utilization rate of the rest parts in the audio to be detected are improved, and therefore the technical problem that the utilization rate of the multi-voice audio is low is solved.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a multi-user audio processing method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a multi-user audio processing method according to a second embodiment of the present invention;

fig. 4 is a functional block diagram of an embodiment of the apparatus of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

The terminal of the embodiment of the invention can be a PC, and can also be a mobile terminal device with a display function, such as a smart phone, a tablet computer, a portable computer and the like.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a multi-person voice audio processing program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the multi-person audio processing program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call the multi-person audio processing program stored in the memory 1005, and also perform the following operations:

after the step of determining whether the maximum frequency domain amplitude values in the plurality of frequency domain information satisfy the preset multi-person sound frequency domain condition, respectively, the processor 1001 may call the multi-person sound audio processing program stored in the memory 1005, and further perform the following operations:

Based on the hardware structure, the invention provides various embodiments of the multi-person audio processing method.

In order to solve the above problems, the present invention provides a multi-user audio processing method, that is, by segmenting the audio to be detected, the subsequent operation is facilitated, and the processing efficiency of the audio to be detected is improved; primarily screening multi-person sound paragraphs in the audio clips through the characteristic information of the audio clips; invalid paragraphs in the initial multi-voice paragraphs are further screened by combining semantic recognition results of the initial multi-voice paragraphs, effective single-voice parts in the audio to be detected are reserved to a great extent, effectiveness and utilization rate of the rest parts in the audio to be detected are improved, and therefore the technical problem that the utilization rate of the multi-voice audio is low is solved. The multi-person audio processing method is applied to the terminal.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a multi-user audio processing method.

A first embodiment of the present invention provides a multi-user audio processing method, including:

step S10, acquiring audio to be detected, and dividing the audio to be detected into a plurality of audio segments according to a preset time interval, wherein the audio to be detected comprises a multi-voice part and a single-voice part;

in this embodiment, the preset time interval may be flexibly set according to an actual situation, which is not specifically limited in this embodiment. The audio to be detected is a mixed audio including both a single-person voice part and a multi-person voice part, and the number of the audio to be detected is not limited in the embodiment. The audio to be detected can be input to the terminal by a user in real time, and can also be automatically acquired by the terminal according to a preset program. Specifically, when an audio detection instruction sent by a user is received, the terminal acquires the audio to be detected with a duration of 5 minutes in the audio detection instruction. If the preset time interval is 10 seconds, the terminal divides the audio to be detected for 5 minutes into 30 audio segments according to the time interval of 10 seconds, and the audio segments are sequentially arranged and displayed according to the time sequence.

Step S20, acquiring a plurality of feature information corresponding to a plurality of audio segments, and identifying initial multi-voice paragraphs in the plurality of audio segments according to preset multi-voice feature conditions and the plurality of feature information;

in this embodiment, the feature information may be frequency domain information, voiceprint information, or the like. The initial human voice passage is an audio segment containing a plurality of human voice parts. The terminal acquires characteristic information corresponding to a plurality of audio segments divided by the audio to be detected respectively, and identifies a multi-person sound voice part in the audio segments according to a preset multi-person sound characteristic condition. Specifically, the setting of the specific embodiment in step S10 is continued, and the feature information is taken as the voiceprint information, and the preset multi-person voice recognition condition is taken as the multi-person voiceprint condition as an example. The computer inputs 30 audio segments into a preset trained voiceprint recognition model to obtain a preliminary prediction result corresponding to the 30 audio segments. And the computer judges whether the prediction result has a voiceprint type which is more than or equal to two prediction results. If yes, the multi-person voice part exists in the audio segment corresponding to the prediction result. In addition, the starting and ending positions of the multi-voice part in the audio segment can be further determined and marked as the segmentation points of the initial multi-voice paragraph. In further embodiments, the characteristic information is taken as frequency domain information, and the predetermined multi-user voice recognition condition is taken as a multi-user audio domain condition. The computer can perform Fourier transform on a plurality of audio segments to obtain a plurality of frequency domain information, because the frequency domain amplitude of the single audio varies with different speakers, but the maximum frequency domain amplitude still falls within a rough range. However, the maximum frequency domain amplitude of the multi-person audio is greatly different from the maximum frequency domain amplitude of the single-person audio. Therefore, the computer can judge whether the maximum frequency domain amplitude value in each frequency domain information meets the preset multi-voice frequency domain condition or not. And if so, taking the audio segment corresponding to the current maximum frequency domain amplitude as the initial multi-person sound segment by the computer.

Step S30, obtaining a semantic recognition result of the initial multi-voice passage, and determining and separating a target multi-voice passage in the initial multi-voice passage according to the semantic recognition result.

In this embodiment, the terminal may input the initial multi-user sound paragraph into a preset semantic recognition model to obtain a corresponding semantic recognition result, or may receive a semantic recognition result manually input through manual recognition and judgment. And the terminal determines the start-stop position of the truly invalid voice part according to the semantic recognition result of the initial multi-voice paragraph, and removes the truly invalid voice part from the initial multi-voice paragraph as a target multi-voice paragraph. Specifically, following the setting in the specific embodiment in step S10, if the computer determines that the 21 st to 25 th audio segments of the 30 audio segments arranged in time sequence are the initial multiple-voice segments, and determines that there is only multiple-voice and no single-voice in the 22 nd to 24 th audio segments according to the voiceprint, the 22 nd to 24 th audio segments can be directly used as the target multiple-voice segments to be removed. After the semantics of the 21 st and 25 th audio segments are analyzed, the single voice segment is found from the 0 th to the 6 th seconds in the 21 st audio segment, the multi-person voice segment is found from the 7 th to the 10 th seconds, and the association degree between the single voice segment from the 4 th to the 6 th seconds and the previous single voice segment is small, so that the single voice segment belongs to an invalid voice segment; the 0 th to 4 th seconds in the 25 th audio segment are multi-person voice segments, the 5 th to 10 th seconds are single-person voice segments, and the correlation degree between the single-person voice segments from the 5 th to 7 th seconds and the previous single-person voice segments is small and also belongs to invalid voice segments. Therefore, the speech part from the 4 th second in the 21 st audio segment to the 7 th second in the 25 th audio segment can be separated from the original audio as the target multi-voice segment, i.e. the invalid speech segment.

In the embodiment, the audio to be detected is obtained and divided into a plurality of audio segments according to a preset time interval, wherein the audio to be detected comprises a multi-person sound part and a single-person sound part; acquiring a plurality of feature information corresponding to a plurality of audio segments, and identifying initial multi-voice paragraphs in the plurality of audio segments according to preset multi-voice feature conditions and the feature information; and obtaining a semantic recognition result of the initial multi-voice paragraph, and determining and separating a target multi-voice paragraph in the initial multi-voice paragraph according to the semantic recognition result. By the mode, the audio to be detected is segmented, so that the subsequent operation is facilitated, and the processing efficiency of the audio to be detected is improved; primarily screening multi-person sound paragraphs in the audio clips through the characteristic information of the audio clips; invalid paragraphs in the initial multi-voice paragraphs are further screened by combining semantic recognition results of the initial multi-voice paragraphs, effective single-voice parts in the audio to be detected are reserved to a great extent, effectiveness and utilization rate of the rest parts in the audio to be detected are improved, and therefore the technical problem that the utilization rate of the multi-voice audio is low is solved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a multi-person audio processing method. In a second embodiment of the present invention, a method for processing multi-user audio is provided, in which step S20 includes:

step S21, performing fourier transform on the plurality of audio segments to obtain the plurality of frequency domain information;

in this embodiment, the feature information is frequency domain information, and the multi-human voice feature condition is a frequency domain condition. And the terminal performs Fourier transform on a plurality of audio segments into which the mixed voice audio to be detected is divided. The purpose of the fourier transform is to transform the time domain signal into a frequency domain signal, i.e. to obtain corresponding frequency domain segments from a plurality of audio segments. And after the terminal performs Fourier transform on the plurality of audio segments, acquiring corresponding frequency domain information such as maximum and minimum frequency domain amplitudes. It should be noted that, by using fourier transform, the frequency domain amplitude of each frequency domain segment can be directly calculated. This technique is prior art and will not be described further herein. In particular, the fourier transform of an audio piece changes the frequency and the component size at this frequency-in practice-there is usually a large difference in the components of the frequency.

Step S22, respectively judging whether the maximum frequency domain amplitude in the plurality of frequency domain information meets the preset multi-person audio frequency domain condition;

in this embodiment, the maximum frequency domain amplitude in the audio segment with the multi-person audio domain condition as the existence of the multi-person sound is preset as the mutation value. When the terminal acquires the frequency domain amplitude values of the frequency domain segments, whether the maximum frequency domain amplitude value corresponding to each frequency domain segment is the mutation value in all the frequency domain segments is sequentially judged according to the time sequence. Specifically, it may be determined whether the maximum frequency-domain amplitude size in the current frequency-domain segment is a preset multiple or divisor of the mean of the maximum amplitudes of all the frequency-domain segments before or after the current frequency-domain segment in time order.

Step S23, if yes, the audio frequency section corresponding to the current maximum frequency domain amplitude is used as the initial multi-person sound section;

in this embodiment, the initial multi-voice passage is an audio segment in which multi-voice exists in the audio segment. And if the terminal judges that the maximum frequency domain amplitude value in the currently detected frequency domain segment is a mutation value in all the frequency domain segments, judging that the audio segment corresponding to the maximum frequency domain amplitude value is an initial multi-person sound segment.

and step S24, if not, taking the audio segment corresponding to the current maximum frequency domain amplitude as a single voice paragraph.

In the present embodiment, the single-person audio is an audio segment in which only a single-person voice exists in the audio segment. And if the terminal judges that the maximum frequency domain amplitude in the currently detected frequency domain segment is not the mutation value in all the frequency domain segments, judging that the audio segment corresponding to the maximum frequency domain amplitude is the single-person sound audio.

Further, not shown in the figure, in the present embodiment, the step S22 includes:

step a, respectively judging whether the maximum frequency domain amplitude value in the plurality of frequency domain information exceeds a preset threshold value according to the difference value between the time sequence and the average value of the maximum frequency domain amplitude values of the previous or subsequent frequency domain information;

in this embodiment, the computer sequentially determines, according to the time sequence, that the difference between the maximum frequency domain amplitude in each frequency domain segment and the average of the maximum frequency domain amplitudes of all the previous frequency domain segments or all the next frequency domain segments exceeds the preset threshold. The preset threshold value pair is preferably an integer multiple of the maximum frequency domain amplitude mean value, and can be flexibly set according to actual conditions, which is not specifically limited in this embodiment.

B, if the maximum frequency domain amplitude exceeds a preset threshold value, judging that the maximum frequency domain amplitude meets the preset multi-person audio frequency domain condition;

in this embodiment, if the computer determines that the difference between the maximum frequency domain amplitude of the currently detected frequency domain segment and the mean of the maximum frequency domain amplitudes of all the preceding or following frequency domain segments exceeds the preset threshold, it may be further determined that the maximum frequency domain amplitude of the current frequency domain segment meets the preset condition of the multi-user audio domain. Specifically, if the preset threshold is 100, the frequency domain amplitude of the current frequency domain segment is: 78. 69, 71, 87, 93, 180, 200, 230, 202, 299, the maximum frequency domain amplitude is 299. The average value of the previous maximum frequency domain amplitude values is 105, and the subsequent maximum frequency domain amplitude value is 159, so that the current maximum frequency domain amplitude value can be judged to meet the preset multi-person sound frequency domain condition.

And c, if the maximum frequency domain amplitude does not exceed a preset threshold, judging that the maximum frequency domain amplitude does not meet the preset multi-person sound frequency domain condition.

In this embodiment, if the computer determines that the difference between the maximum frequency domain amplitude of the currently detected frequency domain segment and the mean of the maximum frequency domain amplitudes of all the preceding or following frequency domain segments does not exceed the preset threshold, it may be further determined that the maximum frequency domain amplitude of the currently detected frequency domain segment meets the preset condition of the multi-user audio domain.

In the embodiment, whether the multi-voice part exists in each audio segment is further specifically judged by acquiring the frequency domain information corresponding to each audio segment, so that the initial position of the multi-voice part can be obtained through simpler calculation, the calculation complexity is reduced, and the positioning efficiency of the multi-voice is improved; by further judging whether the difference value between the maximum frequency domain amplitude value and the mean value of the maximum frequency domain amplitude value exceeds a threshold value, a judgment result can be obtained more quickly, and the positioning efficiency is further improved.

In the drawings, a third embodiment of the multi-user audio processing method according to the present invention is proposed based on the first embodiment shown in fig. 2. In the present embodiment, step S30 includes:

step d, inputting the initial multi-voice paragraph into a preset semantic recognition model to obtain the semantic recognition result;

in this embodiment, it can be understood that, before step d, the semantic recognition model must be trained and completed on the terminal. And the terminal inputs the audio segments judged as the initial multi-voice paragraphs into a pre-trained semantic recognition model to obtain a semantic recognition result corresponding to the initial multi-voice paragraphs.

Step e, determining semantic segmentation points in the initial multi-voice paragraph according to the semantic recognition result;

in this embodiment, the semantic segmentation point is used to segment the initial multi-voice passage with a semantic association degree lower than a certain threshold before and after segmentation. The initial multi-voice speech may include both valid single-voice speech, invalid single-voice speech, and multi-voice speech. And the terminal determines the segmentation points between the effective single voice and the ineffective single voice as well as the multi-voice in the initial multi-voice paragraph according to the semantic recognition result predicted by the model.

And f, taking the voice paragraphs divided by the semantic segmentation points as the target multi-voice paragraphs, and separating the target multi-voice paragraphs from the initial multi-voice paragraphs.

In this embodiment, the terminal uses the invalid single-person voice and the multi-person voice determined by the semantic segmentation point as the target multi-person voice paragraph, and segments the target multi-person voice paragraph from the original audio segment to screen out the invalid part of the original audio segment.

Further, in this embodiment, before step S10, the method further includes:

step g, performing word segmentation processing on preset text data to obtain an attribute sequence of words in the preset text data;

in this embodiment, a computer acquires a plurality of preset text data, and performs coarse-grained word segmentation on each piece of text data. And the computer marks the attribute information of each coarse-grained word according to the mapping relation table of the attribute information and the words, then performs fine-grained word segmentation on each coarse-grained word, and marks the attribute information of each fine-grained word according to the mapping relation table of the attribute information and the words to obtain an attribute sequence.

Step h, vectorizing the attribute sequence to obtain a word vector corresponding to the attribute sequence;

in this embodiment, the computer uses the position index of each attribute information in the attribute dictionary to replace each attribute information in the attribute sequence, so as to obtain the id file of the attribute sequence. And the computer converts the id file of the attribute sequence into an attribute sequence matrix of the number of the currently trained text data, the maximum length of the currently trained text data and the total number of the types of the attribute information, and converts the attribute sequence matrix into a word vector matrix.

Step i, splicing the word vector and the text vector of the corresponding text data to generate input data;

in this embodiment, a computer obtains a word2vec text vector matrix corresponding to a plurality of text data, wherein the word2vec text vector is obtained by training a neural probability language model alone. And the computer splices the word vector matrix and word2vec text vector matrixes corresponding to the plurality of text data to obtain the input data.

And j, training the input data and the corresponding semantic input result as a training data set to obtain the preset semantic recognition model.

In this embodiment, the computer takes the input data and a preset corresponding standard semantic input result as a training data set for training a semantic recognition model, so as to obtain the preset semantic recognition model. The technique of model training belongs to the prior art, and is not described herein again.

Further, in this embodiment, after step S30, the method further includes:

and step k, splicing the audio segments separated out from the laggard target multi-person sound segments according to a time sequence to generate target single-person sound audio.

In this embodiment, the terminal splices the audio segments with the laggard target multi-person sound segments according to a time sequence to generate a target single-person audio, so that an audio manager can archive and store the single-person audio. Wherein the target single-person audio is an audio of an effective single-person voice. In practical situations, blank speech parts in the audio can be removed to reduce occupation of invalid resources.

Further, in this embodiment, before step S10, the method further includes:

and step l, acquiring an initial audio, and performing noise reduction processing on the initial audio by using a preset convolutional neural network model to generate the audio to be detected.

In this embodiment, when an audio detection instruction sent by a user is received, a terminal obtains an initial audio in the audio detection instruction, and performs noise reduction processing on the initial audio by using a Convolutional Neural Network (CNN) model to generate an audio to be detected, so as to reduce an error.

In the embodiment, the target multi-person sound paragraph which needs to be removed finally is accurately positioned by further combining semantics, so that the accuracy of multi-person sound positioning is improved; a semantic recognition model is trained through a large number of training samples, so that the accuracy of semantic recognition is improved; the audio segments with the laggard target multi-person sound segments are separated and spliced, so that the effective audio part is managed in a centralized manner, and the utilization rate of audio materials is improved; by reducing the noise of the initial audio, the interference of the noise is eliminated, and the accuracy of the identification result is improved.

The invention also provides a multi-person audio processing device.

The multi-person audio processing device includes:

The invention also provides a multi-person audio processing device.

The multi-person audio processing device comprises a processor, a memory and a multi-person audio processing program which is stored on the memory and can run on the processor, wherein when the multi-person audio processing program is executed by the processor, the steps of the multi-person audio processing method are realized.

The method for implementing the multi-user audio processing program when executed may refer to various embodiments of the multi-user audio processing method of the present invention, and will not be described herein again.

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores therein a multi-person audio processing program, which when executed by a processor implements the steps of the multi-person audio processing method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A multi-user audio processing method, comprising:

2. The multi-person sound audio processing method according to claim 1, wherein the feature information is frequency domain information, the predetermined multi-person sound feature condition is a multi-person audio domain condition, the plurality of feature information corresponding to the plurality of audio segments are obtained, and the step of identifying the initial multi-person sound segment in the plurality of audio segments according to the predetermined multi-person sound feature condition and the plurality of feature information comprises:

3. The multi-person sound audio processing method according to claim 2, wherein the step of determining whether the maximum frequency domain amplitude values in the plurality of frequency domain information satisfy the predetermined multi-person sound audio domain condition, respectively, comprises:

4. The multi-person sound audio processing method according to claim 1, wherein the step of obtaining the semantic recognition result of the initial multi-person sound paragraph, and determining and separating the target multi-person sound paragraph in the initial multi-person sound paragraph according to the semantic recognition result comprises:

5. The multi-person sound audio processing method according to claim 4, wherein before the step of obtaining the audio to be detected and dividing the audio to be detected into the plurality of audio segments according to the preset time interval, the method further comprises:

6. The multi-person sound audio processing method according to claim 1, wherein after the step of determining and separating the target multi-person sound paragraph in the audio segment according to the semantic recognition result of the multi-person sound paragraph and the audio segment, further comprising:

7. The multi-person sound audio processing method according to any one of claims 1 to 6, wherein before the step of obtaining the audio to be detected and dividing the audio to be detected into a plurality of audio segments according to a preset time interval, the method further comprises:

8. A multi-person audio processing device, comprising:

9. A multi-person sound audio processing device, characterized in that the multi-person sound audio processing device comprises: a memory, a processor and a multi-person sound audio processing program stored on the memory and executable on the processor, the multi-person sound audio processing program when executed by the processor implementing the steps of the multi-person sound audio processing method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a multi-person sound audio processing program which, when executed by a processor, implements the steps of the multi-person sound audio processing method of any one of claims 1 to 7.