CN109616097B

CN109616097B - Voice data processing method, device, equipment and storage medium

Info

Publication number: CN109616097B
Application number: CN201910018423.9A
Authority: CN
Inventors: 刘博卿; 贾雪丽; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2024-05-10
Anticipated expiration: 2039-01-04
Also published as: WO2020140374A1; CN109616097A

Abstract

The embodiment of the invention discloses a voice data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data to be processed, wherein the voice data to be processed consists of voice data segments of a plurality of objects; processing the voice data to be processed according to a preset processing rule to obtain target voice data; dividing the target voice data to obtain a plurality of voice data segments; inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment; determining boundary points of the target voice data according to the prediction labels of the voice data segments, so as to segment the voice data segments of each object from the target voice data according to the boundary points; the boundary point of the voice data can be automatically acquired, and the accuracy of acquiring the boundary point of the voice data can be improved.

Description

Voice data processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing voice data.

Background

In general, many voice data received by the call center are mixed with voice fragments of multiple persons, and then the voice data needs to be subjected to voice segmentation (speaker diarization) to further perform voice analysis on the target voice fragments. The voice segmentation refers to: the speech segments of each person are segmented by acquiring speech boundary points for each two persons (a boundary point may refer to a transition point where each two persons speak). In practice, the voice needs to be manually analyzed to obtain the voice boundary points of every two people, so that the voice segmentation efficiency and accuracy are low.

Disclosure of Invention

The embodiment of the invention provides a voice data processing method, a device, equipment and a storage medium, which can automatically detect boundary points of voice data and improve the efficiency and accuracy of voice segmentation.

In a first aspect, an embodiment of the present invention provides a method for processing voice data, including:

acquiring voice data to be processed, wherein the voice data to be processed consists of voice data segments of a plurality of objects;

Processing the voice data to be processed according to a preset processing rule to obtain target voice data, wherein the preset processing rule comprises a data filtering rule and/or a data format processing rule;

dividing the target voice data to obtain a plurality of voice data segments; inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point;

And determining boundary points of the target voice data according to the prediction labels of the voice data segments, so as to divide the voice data segments of each object from the target voice data according to the boundary points.

In a second aspect, an embodiment of the present invention provides a voice data processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be processed, and the voice data to be processed consists of voice data segments of a plurality of objects;

The processing unit is used for processing the voice data to be processed according to preset processing rules to obtain target voice data, wherein the preset processing rules comprise data filtering rules and/or data format processing rules;

The prediction unit is used for dividing the target voice data to obtain a plurality of voice data segments; inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point;

And the segmentation unit is used for determining boundary points of the target voice data according to the prediction labels of the voice data segments so as to segment the voice data segments of each object from the target voice data according to the boundary points.

In a third aspect, an embodiment of the present invention provides another electronic device, including:

A processor adapted to implement one or more instructions; and

A computer readable storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium comprising: the computer readable storage medium stores one or more instructions adapted to be loaded by a processor and to perform the steps of:

In the embodiment of the invention, the target voice data is obtained by carrying out format and/or filtering and other treatments on the voice data to be treated, so that the interference of environmental noise and the like on the subsequent treatment process can be avoided, and the treatment efficiency can be improved. In addition, the voice data segment of the target voice data is input into the voice network model for prediction to obtain a prediction label of each voice data segment, a boundary point of the target voice data is determined according to the prediction label, and the target voice data can be segmented according to the boundary point to obtain the voice data segment of each object. The boundary point of the target voice data can be automatically obtained through the voice network model, manual operation is not needed, a great amount of manpower can be saved, and the accuracy and the efficiency of obtaining the boundary point of the voice data can be improved; the intelligent and automatic requirements of users on voice data processing are met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice data processing method according to an embodiment of the present invention;

Fig. 2 is a flow chart of a voice data processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a voice data processing apparatus according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Based on the problem that the efficiency and accuracy of voice segmentation are low because a manual analysis mode is needed to determine boundary points of voice data in the prior art, the embodiment of the invention provides an automatic voice data processing method which can be executed by electronic equipment, wherein the electronic equipment can be intelligent terminals, servers, computers or detectors and other equipment; according to the method, the voice data can be predicted through the voice network model to obtain the prediction label, the boundary point of the voice data is determined according to the prediction label, the voice data segment of each object is segmented according to the boundary point of the voice data, a large amount of manpower can be saved, the detection accuracy can be improved, and the intelligent and automatic requirements of user voice segmentation are met.

Please refer to fig. 1, which is a flowchart illustrating a voice data processing method according to an embodiment of the present invention, where the method according to the embodiment of the present invention may be performed by the above-mentioned electronic device. In this embodiment, the voice data processing method includes the following steps.

S101, acquiring voice data to be processed, wherein the voice data to be processed consists of voice data segments of a plurality of objects.

In the embodiment of the invention, the specific content included in the processed voice data in different application scenes is different. For example, in a conference scenario, the voice data to be processed may refer to voice data obtained by recording voices spoken by a plurality of people; in the call application scenario, the voice data to be processed may refer to call data of a plurality of persons received by the call center. The subject here is mainly a speaking person, and the subject may also be an animal.

S102, processing the voice data to be processed according to preset processing rules to obtain target voice data, wherein the preset processing rules comprise data filtering rules and/or data format processing rules.

In the embodiment of the invention, because the voice data to be processed comprises redundant data, such as mute data or non-human voice, the non-human voice comprises equipment sound (such as photographing sound) and environment sound (such as vehicle sound); in order to avoid processing redundant data and improve processing efficiency, the voice data to be processed can be preprocessed. Specifically, the electronic device may perform filtering processing on the voice data to be processed by using a filter to obtain target voice data, where the filter may be a high-pass filter or a band-pass filter. And/or to facilitate the prediction of voice data by the voice network model, the electronic device may perform format conversion processing on the voice data to be processed.

In one embodiment, the preset processing rule includes a data filtering rule, and step S102 includes: dividing the voice data to be processed to obtain a plurality of original voice data segments, obtaining the energy value of each original voice data segment in the plurality of original voice data segments, deleting the original voice data segments with the energy value smaller than or equal to a preset energy value in the plurality of original voice data segments, and merging the original voice data segments with the energy value larger than the preset energy value in the plurality of original voice data segments to obtain the target voice data.

It is often useful that the energy of the speech data is greater than the energy of the interfering speech data (e.g., silence or non-human voice), so the electronic device may filter the speech data to be processed based on the energy of the speech data. Specifically, the electronic device may divide the voice data to be processed according to a preset length to obtain a plurality of original voice data segments, and perform time-frequency transformation on each of the plurality of original voice data segments to obtain frequency domain information of each original voice data segment, where the frequency domain information of each original voice data segment is used to describe a relationship between frequency and energy of each original voice data segment. Acquiring the energy value of each original voice data segment according to the frequency domain information of each original voice data segment, wherein the probability that the original voice data segment with smaller energy value is interference voice data is higher, and the probability that the original voice data segment with higher energy value is useful voice data is higher; therefore, the original voice data segments with the energy value smaller than or equal to the preset energy value in the original voice data segments can be deleted, and the original voice data segments with the energy value larger than the preset energy value in the original voice data segments can be combined to obtain the target voice data.

In another embodiment, the preset processing rule includes a data format processing rule, and step S102 includes: and acquiring the data format of the voice data to be processed, and when the data format of the voice data to be processed is different from a preset data format, performing format conversion processing on the voice data to be processed according to the preset data format to obtain the target voice data.

In order to facilitate the prediction of the voice data by the voice network model, the electronic device may acquire a data Format of the voice data to be processed, and when the data Format of the voice data to be processed is different from a preset data Format, the data Format of the voice data to be processed is indicated to be not suitable for the prediction processing of the voice network model, the preset data Format may be a data Format suitable for the prediction of the voice data by the voice network model, for example, the Format may be pulse code modulation (Pulse Code Modulation, PCM), or voice interaction Format (Audio INTERCHANGE FILE Format, AIFF), or the like. And carrying out format conversion processing on the voice data to be processed according to the preset data format to obtain the target voice data, wherein the data format of the target voice data is the preset data format.

S103, dividing the target voice data to obtain a plurality of voice data segments, inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point.

In the embodiment of the invention, in order to improve the prediction efficiency, the electronic device may divide the target voice data into a plurality of voice data segments that do not overlap with each other, or in order to improve the prediction accuracy, the electronic device may divide the target voice data into a plurality of voice data segments that overlap with each other. Further, inputting a plurality of voice data segments into a voice network model, and analyzing and predicting characteristic points in each voice data segment by the voice network model to obtain a prediction label of each voice data segment.

In one embodiment, the electronic device may divide the target speech data into a plurality of speech data segments that do not overlap each other. Specifically, the electronic device may divide the plurality of voice data segments into a plurality of voice data segments according to a preset voice length. For example, the preset voice length is 10s, the length of the voice data to be processed is 100s, the electronic device may divide the voice data to be processed into 10 voice data segments, for example, a first voice data segment is 0-10 s, a second voice data segment is 10-20 s, a third voice data segment is 20-30 s, and so on to obtain 10 voice data segments.

In another embodiment, if the speech data segments are non-overlapping speech data segments, no boundary points are easily detected due to the rapid speech transition between the objects. For example, the first voice data segment and the second voice data segment are non-overlapping and adjacent voice data segments, if the first voice data segment is the voice data of the object a, and if the second voice data segment is the voice data of the object B. Since the feature points in the first voice data have similarity, and the feature points in the second voice data have similarity, the first voice data and the second voice data are not predicted as boundary points, and the boundary points cannot be detected. To avoid missing boundary points, the electronic device may divide the target speech data into a plurality of speech data segments that overlap one another. Specifically, the target voice data is divided according to a preset step length to obtain a plurality of voice data segments, and the length of each voice data segment in the plurality of voice data segments is larger than the preset step length. The length of each voice data segment may be the same or different, and the length of each voice data segment is the same, for example, the preset step length is 8s, the length of each voice data segment is 10s, the first voice data segment is 0-10 s, the second voice data segment is 8-18 s, and the third voice data segment is 16-26 s. Overlapping speech data is included between every two adjacent speech data segments.

In one embodiment, the inputting the plurality of voice data segments into the voice network model for prediction to obtain a prediction tag of each voice data segment includes: the voice network model analyzes the similarity of each characteristic point in each voice data segment, calculates the similarity sum of each voice data segment according to the similarity of each characteristic point, and determines the prediction label of the corresponding voice data segment according to the similarity sum of each voice data segment. The larger the similarity sum of the voice data segments is, the larger the probability that the voice data segments are voice data of the same object is, and the probability that the voice data segments are boundary points is smaller; the smaller the similarity sum of the voice data segments, the smaller the probability that the voice data segments are voice data of the same object, and the larger the probability that the voice data segments are boundary points. For example, for a first speech data segment, the speech data segment includes 5 feature points, x1, x2, x3, x4, and x5, respectively. The electronic device may calculate the similarity between each feature point and other feature points, for example, calculate the similarity between x1 and x2, x3, x4 and x5, respectively, accumulate each similarity to obtain a similarity sum of x1, calculate the similarity sum of x2-x5 similarly, accumulate the similarity sum of each feature point to obtain a similarity sum of the first speech data segment, determine a prediction tag of the first speech data segment according to the similarity sum of the first speech data segment, and output the preset tag. The predictive label includes a probability that a segment of speech data is a boundary point, the value of the probability being [0,1], the greater the value of the probability, the greater the probability that the corresponding segment of speech data (or feature point within the segment of speech data) is a boundary point, the smaller the value of the probability, the less the probability that the corresponding segment of speech data (or feature point within the segment of speech data) is a boundary point. In particular, if the probability value is 0, the corresponding speech data segment (or the feature point in the speech data segment) is not a boundary point; if the probability value is 1, the corresponding speech data segment (or feature point in the speech data segment) is the boundary point.

In one embodiment, the feature points of the speech data segment may refer to any one or more of energy, pitch, timbre, etc., where energy refers to the intensity of sound (i.e., the loudness of sound), pitch refers to the height of sound, and timbre refers to the characteristics of sound. When the feature points of the voice data segments may include energy, pitch and timbre, the calculating the similarity sum of each voice data segment includes: calculating according to the similarity between the energy characteristic points of each voice data segment to obtain a first similarity sum of the corresponding voice data segment, calculating according to the similarity between the tone characteristic points of each voice data segment to obtain a second similarity sum of the corresponding voice data segment, and calculating according to the similarity between the tone characteristic points of each voice data segment to obtain a third similarity sum of the corresponding voice data segment; and carrying out weighted summation on the first similarity sum, the second similarity sum and the third similarity sum to obtain the similarity sum of each voice data segment. The weights corresponding to the first similarity sum, the second similarity sum and the third similarity sum may be set by a user or may be set by the electronic device according to an application scenario. For example, in a scene where the sound loudness differences of the respective objects are large, the weight of the first similarity sum may be set to a large value to emphasize the differences in sound energy of the respective objects. In a scenario where the difference in pitch of the respective objects is large, the weight of the second similarity sum may be set to a large value to highlight the difference in pitch of the respective objects. In a scene where the difference in tone color of each object is large, the weight of the third similarity sum may be set to a large value to highlight the difference in tone color of each object.

In one embodiment, the voice network model may be composed of two long-short term memory network models (i.e., bi-LSTMs) and a multi-layer neural network model that may be connected to one of the long-short term memory network models. Each Bi-LSTMs comprises a forward LSTM (long-short-term memory) network layer and a backward LSTM (long-short-term memory) network layer, and the outputs of the forward LSTM and the backward LSTM are connected together and then output to the next layer; so that Bi-LSTMs can process the voice data segment in both the forward and reverse directions so that both the previous and future context information can be utilized. The multi-layer neural network model consists of three fully connected layers, the activation equation of the first two layers can be a tanh function, and the activation equation of the last layer can be a sigmoid function; so that a probability (predictive label) ranging between 0 and 1 can be output.

S104, determining boundary points of the target voice data according to the prediction labels of the voice data segments so as to segment the voice data segments of each object from the target voice data according to the boundary points.

In the embodiment of the invention, in order to segment the voice data segment of each object, the electronic device may determine the boundary point of the target voice data according to the prediction label of the voice data segment, for example, the electronic device may use the voice data segment with the probability greater than the preset probability as the boundary point of the target voice data, or use part of the feature points in the voice data segment with the probability greater than the preset probability as the boundary point of the target voice data. And dividing the voice data segment of each object from the target voice data according to the boundary edge so as to analyze the voice data segment of a certain object. The boundary point of the target voice data may refer to a transition point of a voice data segment of a different object, for example, assuming that a fourth voice data segment in the target voice data is a boundary point, the first voice data segment to the fourth voice data segment are voice data segments of a first object (e.g., a first person), and voice data segments after the fourth voice data segment are voice data segments of a second object (e.g., a second person).

Fig. 2 is a schematic flow chart of another voice data processing method according to an embodiment of the present invention, and the method according to the embodiment of the present invention may be performed by the above-mentioned electronic device. In this embodiment, the voice data processing method includes the following steps.

S201, acquiring a training sample set, wherein the training sample set comprises a plurality of sample voice data segments of sample audio data and labeling labels of each sample voice data segment, and the labeling labels comprise probabilities of the sample voice data segments as boundary points.

S202, taking the plurality of sample voice data segments as input of the voice network model, taking the label tag of each sample voice data segment as a training target of the voice network model, and carrying out iterative training on the voice network module.

In steps S201 and S202, in order to improve the prediction accuracy of the voice network model, the voice network model may be optimally trained. In particular, the electronic device may collect a training sample set comprising a plurality of sample speech data segments of sample audio data, and a labeling for each sample speech data segment. The labeling label comprises the probability that the sample voice data segment is a boundary point, and the labeling label can refer to manually labeling the feature sample. To improve the applicability of the voice network model, the sample audio data may be composed of voice data segments of objects in different areas, or/and the sample audio data may be composed of voice data segments of objects of different ages. Further, the plurality of sample voice data segments are used as the input of the voice network model, and the label tag of each sample voice data segment is used as the training target of the voice network model; when the prediction label of the voice data segment output by the voice network model is the same as the labeling label of the voice data segment or is similar to the labeling label of the voice data segment, the prediction accuracy of the voice network model is higher, and the iterative training of the voice network model is finished; when the difference between the prediction label of the voice data segment output by the voice network model and the labeling label of the voice data segment is larger, which indicates that the prediction accuracy of the voice network model is lower, the network parameters of the voice network model are adjusted, and the iterative training is continuously carried out on the voice network model.

In one embodiment, step S202 includes: inputting the plurality of sample voice data segments into the voice network model for prediction to obtain a prediction label of each sample voice data segment; determining a prediction error of the voice network model according to the prediction label of each sample voice data segment and the prediction label of the corresponding sample voice data segment; if the prediction error of the voice network model is larger than the preset error value, adjusting network parameters of the voice network model; if the prediction error of the voice network model is smaller than or equal to the preset error value, ending the iterative training of the voice network model.

The electronic equipment can optimize a voice network model by adjusting network parameters of the voice network model, specifically, the plurality of sample voice data segments are input into the voice network model for prediction to obtain a prediction label of each sample voice data segment, and a prediction error of the voice network model is determined according to the prediction label of each sample voice data segment and the label of the corresponding sample voice data segment; if the prediction error of the voice network model is larger than the preset error value, which indicates that the prediction accuracy of the voice network model is lower, the network parameters of the voice network model are adjusted, and iterative training is continuously carried out on the voice network model; if the prediction error of the voice network model is smaller than or equal to the preset error value, the prediction accuracy of the voice network model is higher, and the iterative training of the voice network model can be ended. For example, assuming that the target speech data segment includes T speech data segments, which may be represented as x= (X1, X2, … XT), wherein each speech data segment may be composed of a plurality of feature points, the label tag of the i-th speech data segment is y _i, and the prediction tag of the i-th speech data segment is f (X) _i, the prediction error L of the speech network model may be represented as shown in formula (1).

In one embodiment, receiving an instruction of setting a labeling label of a target sample voice data segment as a first type labeling label, wherein the first type labeling label is a label with a probability that a sample voice data segment is a boundary point larger than a preset probability, and the target sample voice data segment is a voice data segment where the boundary point of sample audio data is located; setting the labeling label of the sample voice data segment, the time interval between the labeling label and the target sample voice data segment is smaller than or equal to the preset time interval, as a first type labeling label; and setting the labeling label of the sample voice data segment, the time interval between the labeling label and the target sample voice data segment is larger than the preset time interval, as a second type labeling label, wherein the second type labeling label is a label with the probability that the sample voice data segment is a boundary point smaller than or equal to the preset probability.

The voice data comprises boundary points and non-boundary points, and the probability that only few voice data segments are boundary points is larger than a preset threshold value and the probability that a great number of voice data segments are boundary points is smaller than the preset threshold value because the boundary points in the voice data are relatively less; the unbalanced (too much different) nature of these two types of data causes problems when training the voice network model, such as resulting in lower accuracy of predictions for the voice network model. Thus, the number of positive samples, i.e. the speech data segments with a probability greater than a preset probability, will be increased for the vicinity of the true boundary point. Specifically, the electronic device may receive an instruction to set the label of the target sample speech data segment as the first type label, where the target sample speech data segment is a speech data segment where a boundary point of the sample audio data is located, that is, where the target sample speech data segment is a speech data segment where a real boundary point is located. In order to solve the unbalance problem, a labeling label of a sample voice data segment with a time interval smaller than or equal to a preset time interval with the target sample voice data segment may be set as a first type of labeling label, that is, voice data segments near the target sample voice data segment are all labeled as voice data segments where boundary points (i.e., positive samples) are located. And setting the labeling label of the sample voice data segment with the time interval larger than the preset time interval with the target sample voice data segment as a second type labeling label, namely labeling the sample voice data segment with the time interval larger than the preset time interval with the target sample voice data segment as the voice data segment where the non-boundary point is located. For example, the preset time interval is 50ms, and the electronic device may label all the sample voice data segments within the time interval of 50ms from the target sample data segment as the first type label.

S203, acquiring voice data to be processed, wherein the voice data to be processed consists of voice data segments of a plurality of objects.

S204, processing the voice data to be processed according to preset processing rules to obtain target voice data, wherein the preset processing rules comprise data filtering rules and/or data format processing rules.

S205, dividing the target voice data to obtain a plurality of voice data segments, inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point.

S206, determining boundary points of the target voice data according to the prediction labels of the voice data segments, so as to segment the voice data segments of each object from the target voice data according to the boundary points.

In one embodiment, each of the voice data segments includes a plurality of feature points, each two adjacent voice data segments include the same feature point, the predictive label of each of the voice data segments is a predictive label of each feature point in the corresponding voice data segment, and step S206 includes: and counting the average probability of each feature point as a boundary point according to the predictive label of each voice data segment, and taking all feature points with average probability larger than preset probability as boundary points of the target voice data.

Because each adjacent voice data segment has overlapping property, each two adjacent voice data segments comprise the same characteristic point, and under the condition that two probabilities exist for the same characteristic point, the electronic equipment can determine the boundary point of the voice data according to the average probability of each characteristic point. For example, assuming that the target speech data includes a first speech data segment and a second speech data segment, the first speech data segment includes feature points x1, x2, x3, x4, and x5, and the prediction labels of the first speech data segment are 0.2, the prediction labels (probabilities) of the feature points x1, x2, x3, x4, and x5 are all 0.2; the second speech data segment includes feature points x4, x5, x6, x7 and x8. The predictive label of the second voice data segment is 0.5, and the predictive labels of the feature points x4, x5, x6, x7 and x8 are all 0.5; and determining an average label (i.e. average probability) of each feature point according to the prediction labels, namely, the average prediction labels of the feature points x1, x2 and x3 are all 0.2, the average prediction labels of the feature points x4 and x5 are 0.35, and the average prediction labels of the feature points x6, x7 and x8 need to be calculated according to the prediction labels of the third voice data segment. Further, all feature points with the average probability larger than the preset probability are used as boundary points of the target voice data.

Fig. 3 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present invention, where the apparatus according to the embodiment of the present invention may be disposed in the above-mentioned electronic device. In this embodiment, the apparatus includes:

an obtaining unit 301, configured to obtain to-be-processed voice data, where the to-be-processed voice data is composed of voice data segments of a plurality of objects.

The processing unit 302 is configured to process the voice data to be processed according to a preset processing rule, so as to obtain target voice data, where the preset processing rule includes a data filtering rule and/or a data format processing rule.

A prediction unit 303, configured to divide the target voice data to obtain a plurality of voice data segments; and inputting the voice data segments into a voice network model for prediction to obtain a prediction label of each voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point.

A dividing unit 304, configured to determine a boundary point of the target voice data according to the prediction label of each voice data segment, so as to divide the voice data segment of each object from the target voice data according to the boundary point.

Optionally, the processing unit 302 is configured to divide the target voice data according to a preset step size, so as to obtain a plurality of voice data segments, where a length of each voice data segment in the plurality of voice data segments is greater than the preset step size.

Optionally, the training unit 305 is configured to obtain a training sample set, where the training sample set includes a plurality of sample speech data segments of sample audio data, and a label of each sample speech data segment, where the label includes a probability that the sample speech data segment is a boundary point; and taking the plurality of sample voice data segments as input of the voice network model, taking the labeling label of each sample voice data segment as a training target of the voice network model, and carrying out iterative training on the voice network module.

Optionally, the training unit 305 is configured to input the plurality of sample speech data segments into the speech network model for prediction, so as to obtain a prediction label of each sample speech data segment; determining a prediction error of the voice network model according to the prediction label of each sample voice data segment and the labeling label of the corresponding sample voice data segment; if the prediction error of the voice network model is larger than a preset error value, adjusting network parameters of the voice network model; and if the prediction error of the voice network model is smaller than or equal to the preset error value, ending the iterative training of the voice network model.

Optionally, the training unit 305 is configured to receive an instruction to set a label of a target sample speech data segment as a first type label, where the first type label is a label with a probability that a sample speech data segment is a boundary point greater than a preset probability, and the target sample speech data segment is a speech data segment where the boundary point of the sample audio data is located; setting the labeling label of the sample voice data segment, the time interval between the labeling label and the target sample voice data segment is smaller than or equal to the preset time interval, as a first type labeling label; and setting the labeling labels of the sample voice data segments, the time interval between the labeling labels and the target sample voice data segments is larger than the preset time interval, as second class labeling labels, wherein the probability that the second class labeling labels are the sample voice data segments as boundary points is smaller than or equal to the preset probability.

Optionally, each voice data segment includes a plurality of feature points, each two adjacent voice data segments include the same feature point, and the prediction label of each voice data segment is a prediction label of each feature point in the corresponding voice data segment; a segmentation unit 304, configured to count an average probability that each feature point is a boundary point according to the prediction label of each speech data segment; and taking all the characteristic points with the average probability larger than the preset probability as boundary points of the target voice data.

Optionally, the preset processing rule includes a data filtering rule, and the processing unit 302 is configured to divide the voice data to be processed to obtain a plurality of original voice data segments; acquiring an energy value of each original voice data segment in the plurality of original voice data segments; deleting the original voice data segments with the energy value smaller than or equal to a preset energy value in the plurality of original voice data segments; and merging the original voice data segments with the energy value larger than the preset energy value in the original voice data segments to obtain the target voice data.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device in the embodiment shown in the drawing may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and a memory 404. The processor 401, the input device 402, the output device 403, and the memory 404 are connected via a bus 404.

The Processor 401 may be a central processing unit (Central Processing Unit, CPU) which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 402 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., the output device 403 may include a display (LCD, etc.), a speaker, etc., and the output device 403 may output prompt information, which may be used to prompt a boundary point of target voice data.

The memory 404 may include read only memory and random access memory and provide instructions and data to the processor 401. A portion of the memory 404 may also comprise a non-volatile random access memory, the memory 404 being adapted to store a computer program comprising program instructions, the processor 401 being adapted to execute the program instructions stored by the memory 404 for performing a speech data processing method, i.e. for performing the following operations:

Optionally, the processor 401 is configured to execute program instructions stored in the memory 404, and is configured to perform the following operations:

Dividing the target voice data according to a preset step length to obtain a plurality of voice data segments, wherein the length of each voice data segment in the plurality of voice data segments is larger than the preset step length.

Acquiring a training sample set, wherein the training sample set comprises a plurality of sample voice data segments of sample audio data and labeling labels of each sample voice data segment, and the labeling labels comprise probabilities of the sample voice data segments as boundary points;

And taking the plurality of sample voice data segments as input of the voice network model, taking the labeling label of each sample voice data segment as a training target of the voice network model, and carrying out iterative training on the voice network module.

inputting the plurality of sample voice data segments into the voice network model for prediction to obtain a prediction label of each sample voice data segment;

Determining a prediction error of the voice network model according to the prediction label of each sample voice data segment and the labeling label of the corresponding sample voice data segment;

If the prediction error of the voice network model is larger than a preset error value, adjusting network parameters of the voice network model;

and if the prediction error of the voice network model is smaller than or equal to the preset error value, ending the iterative training of the voice network model.

Receiving an instruction of setting a labeling label of a target sample voice data segment as a first type labeling label, wherein the first type labeling label is a label with a probability that the sample voice data segment is a boundary point larger than a preset probability, and the target sample voice data segment is a voice data segment where the boundary point of sample audio data is located;

Setting the labeling label of the sample voice data segment, the time interval between the labeling label and the target sample voice data segment is smaller than or equal to the preset time interval, as a first type labeling label;

And setting the labeling labels of the sample voice data segments, the time interval between the labeling labels and the target sample voice data segments is larger than the preset time interval, as second class labeling labels, wherein the probability that the second class labeling labels are the sample voice data segments as boundary points is smaller than or equal to the preset probability.

Counting the average probability of each feature point as a boundary point according to the predictive label of each voice data segment;

And taking all the characteristic points with the average probability larger than the preset probability as boundary points of the target voice data.

dividing the voice data to be processed to obtain a plurality of original voice data segments;

acquiring an energy value of each original voice data segment in the plurality of original voice data segments;

deleting the original voice data segments with the energy value smaller than or equal to a preset energy value in the plurality of original voice data segments;

And merging the original voice data segments with the energy value larger than the preset energy value in the original voice data segments to obtain the target voice data.

The processor 401, the input device 402, and the output device 403 described in the embodiments of the present invention may execute the implementation manners described in the first embodiment and the second embodiment of the voice data processing method provided in the embodiments of the present invention, and may also execute the implementation manner of the electronic device described in the embodiments of the present invention, which is not described herein again.

In an embodiment of the present invention, a computer readable storage medium is provided, where a computer program is stored, where the computer program includes program instructions, which when executed by a processor, implement the method for processing speech data shown in the embodiments of fig. 1 and 2 of the present invention.

The computer readable storage medium may be an internal storage unit of the medical management device according to any of the foregoing embodiments, such as a hard disk or a memory of the control device. The computer readable storage medium may also be an external storage device of the control device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the control device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the control device. The computer-readable storage medium is used to store the computer program and other programs and data required by the control device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the control apparatus and unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated here.

In several embodiments provided by the present application, it should be understood that the disclosed control apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are illustrative, and for example, the division of the units may be a logic function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present invention, and are intended to be included within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of processing speech data, comprising:

Processing the voice data to be processed according to preset processing rules to obtain target voice data, wherein the preset processing rules comprise data filtering rules and data format processing rules; the data filtering rule refers to deleting an original voice data segment with the energy value smaller than or equal to a preset energy value in the voice data to be processed, the data format processing rule refers to converting the data format of the voice data to be processed into a preset data format, and the preset data format is a data format suitable for voice network model voice data prediction;

Dividing the target voice data to obtain a plurality of voice data segments; accumulating the similarity between any feature point in a target voice data segment and other feature points in the target voice data segment through a voice network model to obtain the similarity sum of any feature point, accumulating the similarity sum of all feature points in the target voice data segment to obtain the similarity sum of the target voice data segment, and determining a prediction label of the target voice data segment according to the similarity sum of the target voice data segment, wherein the prediction label comprises the probability that the target voice data segment is a boundary point; the similarity sum and the probability in the prediction label have a negative correlation relationship, and the target voice data segment is any voice data segment in the voice data segments; the characteristic points of the target voice data segment comprise a plurality of items of energy, tone and tone; each voice data segment in the plurality of voice data segments comprises a plurality of characteristic points, every two adjacent voice data segments in the plurality of voice data segments comprise the same characteristic points, and the prediction label of each voice data segment in the plurality of voice data segments is a prediction label of each characteristic point in the corresponding voice data segment;

And taking the voice data section with the probability larger than the preset probability in the voice data sections as the boundary point of the target voice data, or taking part of characteristic points in the voice data section with the probability larger than the preset probability in the voice data sections as the boundary point of the target voice data, or counting the average probability of taking each characteristic point as the boundary point according to the prediction label of each voice data section in the voice data sections, and taking all characteristic points with the average probability larger than the preset probability in the voice data sections as the boundary point of the target voice data so as to divide the voice data section of each object from the target voice data according to the boundary point.

2. The method of claim 1, wherein dividing the target speech data to obtain a plurality of speech data segments comprises:

3. The method according to claim 1, wherein the method further comprises:

4. The method of claim 3, wherein iteratively training the voice network module with the plurality of sample voice data segments as inputs to the voice network model and the labeling of each sample voice data segment as a training target for the voice network model comprises:

5. The method according to claim 3 or 4, characterized in that the method further comprises:

6. The method according to claim 1, wherein the preset processing rules include data filtering rules, and the processing the voice data to be processed according to the preset processing rules to obtain target voice data includes:

7. A voice data processing apparatus, comprising:

The processing unit is used for processing the voice data to be processed according to preset processing rules to obtain target voice data, wherein the preset processing rules comprise data filtering rules and data format processing rules; the data filtering rule refers to deleting an original voice data segment with the energy value smaller than or equal to a preset energy value in the voice data to be processed, the data format processing rule refers to converting the data format of the voice data to be processed into a preset data format, and the preset data format is a data format suitable for voice network model voice data prediction;

The prediction unit is used for dividing the target voice data to obtain a plurality of voice data segments; accumulating the similarity between any feature point in a target voice data segment and other feature points in the target voice data segment through a voice network model to obtain a similarity sum of any feature point, accumulating the similarity sums of all feature points in the target voice data segment to obtain a similarity sum of the target voice data segment, and determining a prediction label of the target voice data segment according to the similarity sum of the target voice data segment, wherein the prediction label comprises the probability that the voice data segment is a boundary point; the similarity sum and the probability in the prediction label have a negative correlation relationship, and the target voice data segment is any voice data segment in the voice data segments; the characteristic points of the target voice data segment comprise a plurality of items of energy, tone and tone; each voice data segment in the plurality of voice data segments comprises a plurality of characteristic points, every two adjacent voice data segments in the plurality of voice data segments comprise the same characteristic points, and the prediction label of each voice data segment in the plurality of voice data segments is a prediction label of each characteristic point in the corresponding voice data segment;

the segmentation unit is used for taking the voice data segments with the probability larger than the preset probability in the voice data segments as boundary points of the target voice data, or taking part of characteristic points in the voice data segments with the probability larger than the preset probability in the voice data segments as boundary points of the target voice data, or counting the average probability of each characteristic point as the boundary point according to the prediction label of each voice data segment in the voice data segments, and taking all the characteristic points with the average probability larger than the preset probability in the voice data segments as the boundary points of the target voice data so as to segment the voice data segments of each object from the target voice data according to the boundary points.

8. An electronic device, comprising:

a processor adapted to implement at least one instruction; and

A computer readable storage medium storing at least one instruction adapted to be loaded by the processor and to perform the speech data processing method according to any one of claims 1-6.

9. A computer readable storage medium storing at least one instruction adapted to be loaded by a processor and to perform the method of speech data processing according to any of claims 1-6.