CN111312224A

CN111312224A - Training method and device of voice segmentation model and electronic equipment

Info

Publication number: CN111312224A
Application number: CN202010106843.5A
Authority: CN
Inventors: 王超; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-06-19
Anticipated expiration: 2040-02-20
Also published as: CN111312224B

Abstract

The embodiment of the disclosure discloses a training method and device of a voice segmentation model, electronic equipment and a computer readable storage medium. The training method of the voice segmentation model comprises the following steps: acquiring an original voice graph of a sample voice file; acquiring the labeling information in the original voice picture; initializing model parameters; inputting the original voice graph into the voice segmentation model to obtain the prediction information of the target voice, wherein the prediction information of the target voice is obtained through a plurality of feature graphs with different scales output by the voice segmentation model; calculating errors of the prediction information and the labeling information according to an objective function and updating parameters of the voice segmentation model; and inputting the original voice graph into the voice segmentation model after the parameters are updated so as to iterate the parameter updating process. The method trains the voice segmentation model through the original voice image, and solves the technical problem of inaccurate voice segmentation caused by complex voice signals in the prior art.

Description

Training method and device of voice segmentation model and electronic equipment

Technical Field

The present disclosure relates to the field of speech segmentation, and in particular, to a method and an apparatus for training a speech segmentation model, an electronic device, and a computer-readable storage medium.

Background

As a man-machine interaction means, the voice recognition technology is significant in the aspect of liberating both hands of human beings. With the emergence of various intelligent sound boxes, voice interaction becomes a new value of an internet entrance, more and more intelligent devices are added with the trend of voice recognition, and the voice interaction becomes a bridge for communication between people and devices. The speech segmentation technique is a branch of speech recognition technique, and is used to classify a segment of speech into different categories according to time periods, such as segmenting a segment of speech into voices of people who are not speaking simultaneously, detecting end points of speech, and waking word alignment, etc., all belonging to the category of speech segmentation.

The speech segmentation in the prior art is divided into a method based on segmentation unit alignment and a method based on segmentation unit boundary detection. The method based on alignment of segmentation units requires a priori knowledge of the phone or syllable associations corresponding to the segmented speech, e.g. the number of phones/syllables involved, in a limited way. Most of the methods based on boundary detection use only the features extracted from the speech signal itself to perform boundary detection.

However, the characteristics of the speech signal are relatively complex, so the accuracy of speech segmentation is still relatively low, which is a problem to be solved.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a method for training a speech segmentation model, including:

acquiring an original voice graph of a sample voice file;

acquiring the labeling information of the target voice in the original voice graph;

initializing model parameters of a voice segmentation model;

inputting the original voice graph into the voice segmentation model to obtain the prediction information of the target voice output by the voice segmentation model, wherein the prediction information of the target voice is obtained through a plurality of first feature graphs with different scales output by the voice segmentation model;

calculating the error of the prediction information and the labeling information according to an objective function;

updating parameters of the voice segmentation model according to the error;

and inputting the original voice graph into the voice segmentation model after the parameters are updated so as to iterate the parameter updating process until the error is smaller than a first threshold value.

Further, the obtaining of the original voice map of the sample voice file includes:

acquiring a sample voice file;

dividing the sample voice file into a plurality of voice frames;

and extracting the voice feature points in the plurality of voice frames to generate an original voice graph of the sample voice file.

Further, the dividing the sample voice file into a plurality of voice frames includes:

acquiring a voice frame length α and a voice frame moving interval β;

a plurality of speech frames are extracted starting from the header of a sample speech file, wherein each speech frame has a length of α and the start of two adjacent speech frames are spaced apart by β.

Further, the extracting of the voice feature points in the plurality of voice frames to generate an original voice map of the sample voice file includes:

carrying out short-time Fourier transform on each of the plurality of voice frames to obtain a plurality of frequency characteristic points;

and forming the plurality of frequency characteristic points into an original voice graph according to the sequence of the voice frames.

Further, the annotation information includes: labeling position information and labeling category information, wherein the prediction information comprises: predicted position information and prediction category information.

Further, the labeling position information includes a pair of labeling boundary points of the target voice in the original voice graph, and the labeling category information includes labeling probabilities of the target voice in multiple categories; the predicted position information comprises position information of a plurality of pairs of predicted boundary points of the target voice in the original voice graph, and the predicted category information comprises predicted probabilities of the target voice in a plurality of categories.

Further, the inputting the original voice map into the voice segmentation model to obtain the prediction information of the target voice output by the voice segmentation model includes:

inputting the original voice graph into the voice segmentation model;

the voice segmentation model outputs a plurality of first feature maps with different scales;

generating a plurality of pairs of default boundary points on the first feature map;

and performing convolution calculation on the first feature map to obtain a plurality of one-dimensional vectors, wherein each element in the one-dimensional vectors corresponds to a pair of default boundary points, and the value of each element is the prediction information of the target voice.

Further, the generating a plurality of pairs of default boundary points on the first feature map includes:

and generating a plurality of pairs of default boundary points by taking each group of pixel points with the same abscissa on the first characteristic diagram as central lines, wherein the logarithm of the default boundary points corresponding to each central line is the same.

Further, the performing convolution calculation on the first feature map to obtain a plurality of one-dimensional vectors includes:

and (C +2) K convolution kernels are used for performing convolution on the first feature map to obtain (C +2) K1N one-dimensional vectors, wherein K is the logarithm of default boundary points corresponding to each central line, C is the number of classifications of the target voice, and N is the number of the central lines.

Further, the calculating an error between the prediction information and the labeling information according to an objective function includes:

and inputting the values of the elements in the one-dimensional vector and the labeling information into the objective function to calculate the error of the prediction information and the labeling information.

In a second aspect, an embodiment of the present disclosure provides a speech segmentation method, including:

acquiring a voice file to be recognized;

dividing the voice file to be recognized into a plurality of voice frames;

extracting voice feature points in the voice frames to generate an original voice graph of a voice file to be recognized;

inputting an original voice picture of the voice file to be recognized into a voice segmentation model, wherein the voice segmentation model is obtained by training through the training method of the voice segmentation model in the first aspect;

and the voice segmentation model outputs the position information and the category information of the target voice in the original voice graph.

In a third aspect, an embodiment of the present disclosure provides a training apparatus for a speech segmentation model, including:

the original voice graph acquisition module is used for acquiring an original voice graph of the sample voice file;

the annotation information acquisition module is used for acquiring annotation information of the target voice in the original voice graph;

the parameter initialization module is used for initializing model parameters of the voice segmentation model;

the prediction information acquisition module is used for inputting the original voice image into the voice segmentation model to obtain prediction information of the target voice output by the voice segmentation model, wherein the prediction information of the target voice is obtained through a plurality of first feature images with different scales output by the voice segmentation model;

the error calculation module is used for calculating the errors of the prediction information and the labeling information according to an objective function;

the parameter updating module is used for updating the parameters of the voice segmentation model according to the errors;

and the parameter iteration module is used for inputting the original voice graph into the voice segmentation model after the parameters are updated so as to iterate the parameter updating process until the error is smaller than a first threshold value.

Further, the original voice map obtaining module is further configured to:

acquiring a sample voice file;

dividing the sample voice file into a plurality of voice frames;

and extracting the voice feature points in the plurality of voice frames to generate an original voice map of the sample voice file.

acquiring a voice frame length α and a voice frame moving interval β;

Further, the prediction information obtaining module is further configured to:

inputting the original voice graph into the voice segmentation model;

Further, the generating, by the prediction information obtaining module, a plurality of pairs of default boundary points on the first feature map includes:

Further, the obtaining, by the prediction information obtaining module, a plurality of one-dimensional vectors by performing convolution calculation on the first feature map includes:

Further, the error calculation module is further configured to:

In a fourth aspect, an embodiment of the present disclosure provides a speech segmentation apparatus, including:

the voice file acquisition module is used for acquiring a voice file to be recognized;

the framing module is used for dividing the voice file to be recognized into a plurality of voice frames;

the original voice graph generating module is used for extracting voice feature points in the voice frames to generate an original voice graph of the voice file to be recognized;

the input module is used for inputting the original voice picture of the voice file to be recognized into a voice segmentation model, wherein the voice segmentation model is obtained by training the training method of the voice segmentation model;

and the output module is used for outputting the position information and the category information of the target voice in the original voice graph by the voice segmentation model.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and

a processor for executing the computer readable instructions, such that the processor when executing implements the method of the first or second aspect.

In a sixth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium storing computer-readable instructions which, when executed by a computer, cause the computer to perform the method of the first or second aspect.

The embodiment of the disclosure discloses a training method and device of a voice segmentation model, electronic equipment and a computer readable storage medium. The training method of the voice segmentation model comprises the following steps: acquiring an original voice graph of a sample voice file; acquiring the labeling information of the target voice in the original voice graph; initializing model parameters of a voice segmentation model; inputting the original voice graph into the voice segmentation model to obtain the prediction information of the target voice output by the voice segmentation model, wherein the prediction information of the target voice is obtained through a plurality of first feature graphs with different scales output by the voice segmentation model; calculating the error of the prediction information and the labeling information according to an objective function; updating parameters of the voice segmentation model according to the error; and inputting the original voice graph into the voice segmentation model after the parameters are updated so as to iterate the parameter updating process until the error is smaller than a first threshold value. The method solves the technical problem of inaccurate voice segmentation caused by complex voice signals in the prior art by converting the voice signals into the original voice images and training the voice segmentation model by using the original voice images.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic view of an application scenario of an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating a method for training a speech segmentation model according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an embodiment of step S201 in a training method of a speech segmentation model according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a plurality of divided speech frames in a speech file according to an embodiment of the present disclosure;

FIGS. 5a-5b are schematic diagrams illustrating generation of an original phonetic diagram of a training method for a speech segmentation model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating labeled location information in a training method of a speech segmentation model according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a speech segmentation model provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of default boundary points generated in a training method of a speech segmentation model according to an embodiment of the present disclosure;

FIG. 9 is a diagram illustrating feature vectors output by a speech segmentation model according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of an embodiment of a training apparatus for a speech segmentation model provided in an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a schematic view of an application scenario of the embodiment of the present disclosure. As shown in fig. 1, a user 101 inputs voice to a terminal device 102, the terminal device 102 may be any terminal device capable of receiving the natural language input, such as a smart phone, a smart speaker, a smart home appliance, and the like, and the terminal device 102 is connected to a voice recognition device 103 through a network, where the voice recognition device 103 may be a computer or a smart terminal, and the like; the network on which the terminal device 102 communicates with the voice recognition device 103 may be a wireless network, such as a 5G network and a wifi network, or a wired network, such as an optical fiber network. In the application scenario, the user 101 speaks a voice, the terminal device 102 collects the voice and sends the voice to the voice recognition device 103, and if the voice recognition device 103 recognizes a target voice in a segmented manner, the terminal device 102 executes a function corresponding to the target voice.

It will be appreciated that the speech recognition device 103 and the terminal device 102 may be arranged together, i.e. the terminal device 102 may incorporate speech recognition functionality, such that a user's speech input may be recognized directly in the terminal device 102. After the voice is recognized, the terminal device 102 may perform a function related to the voice according to the voice.

Fig. 2 is a flowchart of an embodiment of a method for training a speech segmentation model provided in this disclosure, where the method for training the speech segmentation model provided in this embodiment may be performed by a training apparatus of the speech segmentation model, the training apparatus of the speech segmentation model may be implemented as software, or implemented as a combination of software and hardware, and the training apparatus of the speech segmentation model may be integrated in a certain device in a training system of the speech segmentation model, such as a training server of the speech segmentation model or a training terminal device of the speech segmentation model. As shown in fig. 2, the method comprises the steps of:

step S201, obtaining an original voice picture of a sample voice file;

in this embodiment, the sample speech file needs to be processed before the speech segmentation model is trained, and the sample speech file needs to be processed into an image in order to perform speech segmentation using image features.

Optionally, as shown in fig. 3, the step S201 further includes:

step S301, obtaining a sample voice file;

step S302, dividing the sample voice file into a plurality of voice frames;

step S303, extracting the voice feature points in the plurality of voice frames to generate an original voice map of the sample voice file.

In this disclosure, the sample speech file is a speech file in a sample set, and the plurality of speech frames may be a plurality of non-overlapping speech frames connected end to end or a plurality of partially overlapping speech frames.

Optionally, the step S302 includes:

acquiring a voice frame length α and a voice frame moving interval β;

Fig. 4 shows an example of a plurality of speech frames. As shown in FIG. 4, AF is the two endpoints of the voice file, A represents the beginning of the voice file, and F represents the end of the voice file, and in this example, the voice file is divided into 3 voice frames, AB, CD and EF respectively, by a framing operation, where AB and CD have an overlapping area CB and CD and EF has an overlapping area ED. Since the length of a speech file is generally expressed in milliseconds (ms), and an input speech signal is generally considered to be a signal that is short and stable within 10ms to 30ms, in one example, a speech frame length is 25ms, and a speech frame movement interval is 10ms, corresponding to the example in fig. 4, a speech file with a length of AB ═ CD ═ EF ═ 25ms, AC ═ CE ═ 10ms, and AF ═ 45ms is divided into 3 speech frames with a length of 25 and a movement interval of 10 ms.

In step S303, a feature of each of the divided speech frames is extracted to form a feature point, where the feature point may include information of frequency, amplitude, and the like of speech in the speech frame.

Optionally, the step S303 includes: carrying out short-time Fourier transform on each of the plurality of voice frames to obtain a plurality of frequency characteristic points; and forming the plurality of frequency characteristic points into an original voice graph according to the sequence of the voice frames.

Each of the plurality of voice frames represents a voice signal, the voice signal may be represented as a set of pixel points on a two-dimensional image in XY coordinate space, the audio signal is a one-dimensional array, the length of the audio signal is determined by an audio length and a sampling rate, for example, the sampling rate is 16KHz, which represents 16000 points sampled in one second, and if the length of the voice signal is 10s, 160000 values exist in the voice file, and the value is the amplitude of the voice. The speech signal is converted from a time domain signal to a frequency domain signal by using a short-time Fourier transform, when the short-time Fourier transform is used, a parameter H is needed to determine how many points to perform the short-time Fourier transform, if H is 512, 512 points are performed in each speech frame, and because the Fourier transform has symmetry, H/2+1 points are taken when H is an even number, and (H +1)/2 points are taken when H is an odd number, for example, 257 values are finally obtained when H is 512. Forming the obtained points into a one-dimensional column vector as a column vector in an original voice picture, performing the short-time Fourier transform processing on each voice frame to obtain a plurality of column vectors, and arranging the plurality of column vectors according to the sequence of the voice frames to obtain the original voice picture, wherein the horizontal coordinate of the feature picture on a two-dimensional image in an XY coordinate space represents the sequence of the voice frames, namely time; the vertical coordinate of the two-dimensional image of the characteristic diagram on the XY coordinate space represents the frequency of the characteristic point; the gray value of the feature point represents the amplitude of the feature point. The above process is shown in fig. 5a-5b, wherein fig. 5a is a schematic diagram of performing short-time fourier transform on a speech frame, which converts a signal image in time domain into a signal image in frequency domain, and then the conversion is performed into a gray scale graph representing frequency in height, each small square represents a feature point, and the gray scale value represents the amplitude of the point; as shown in fig. 5b, the original voice map generated after all voice frames are subjected to short-time fourier transform is shown, and the abscissa represents time (the sequence of the voice frames in the voice file), the ordinate represents the frequency of the feature points, and the gray scale of the feature points represents the amplitude of the feature points.

Step S202, obtaining the labeling information of the target voice in the original voice graph;

after the feature map of the speech is obtained, the annotation information of the original speech map needs to be obtained, and the annotation information can uniquely label the target speech to be segmented in the original speech map.

Optionally, the tagging information at least includes tagging position information and tagging category information of the target voice in the original voice graph, where the tagging position information includes a pair of tagging boundary points of the target voice in the original voice graph, and the tagging category information includes tagging probabilities of the target voice in multiple categories. It can be understood that, since a list of feature points of the original speech map corresponds to a speech frame, the segmentation of the target speech position only needs two points on the X axis, as shown in fig. 6, where two points on the X axis, X1 and X2, can mark a target speech, the target speech is the speech frame represented between X1 and X2 on the original speech map, and X1 and X2 are a pair of boundary points of the target speech. In addition, the type of the target voice, such as voice and noise, voice of the first person and voice of the second person, etc., needs to be labeled, and for example, a vector may be used to represent the labeling probability of the target voice in multiple categories, where the number of elements in the vector is the number of categories, each element corresponds to a category, to which category the target voice belongs, and which category corresponds to an element having a value of 1, and other elements having a value of 0.

It can be understood that the labeling information may be automatically labeled or manually labeled, and the generation form of the labeling information is not limited in this disclosure and is not described herein again.

Step S203, initializing model parameters of a voice segmentation model;

it is understood that the speech segmentation model may be various network models such as a convolutional neural network. The parameters may include weights, offsets, etc. in the convolution kernels of the various layers in the model. Illustratively, the speech segmentation model includes a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers.

Optionally, the initializing the model parameters of the speech segmentation model includes assigning the parameters to preset values or randomly generating initial values of the parameters.

Step S204, inputting the original voice picture into the voice segmentation model to obtain the prediction information of the target voice output by the voice segmentation model;

the prediction information of the target voice is obtained through a plurality of first feature maps with different scales output by the voice segmentation model.

Optionally, the speech segmentation model includes a base model and a plurality of convolutional layers, where the base model may be any convolutional neural network. As shown in fig. 7, a schematic diagram of a speech segmentation model is shown, where 701 is an original speech diagram obtained in step S201, and the original speech diagram is input into the speech segmentation model, where the speech segmentation model is divided into two parts, one part is a basic model 701, the basic model 701 may be, for example, any convolutional neural network, the original speech diagram passes through the basic model to obtain a feature diagram of the original speech diagram, and then the feature diagram of the original speech diagram is input into a convolutional layer 703 to obtain another feature diagram of the original speech image, and by analogy, the convolutional layer 704 and the convolutional layer 705 also respectively obtain feature diagrams of an original speech image, which are first feature diagrams, and since convolutional kernels of each convolutional layer are different, 4 first feature diagrams with different scales can be obtained, where the scales are dimensions of the first feature diagrams, illustratively, the first feature map output by the basic model 702 is a feature map of 38 × 512, the first feature map output by the convolutional layer 703 is a feature map of 19 × 1024, the first feature map output by the convolutional layer 704 is a feature map of 5 × 256, and the first feature map output by the convolutional layer 705 is a feature map of 1 × 128. These feature maps may be passed through the detection layer 706 to obtain prediction information for the target speech. It can be understood that the structure of the speech segmentation model and the dimensions of the first feature map are examples, and do not limit the present disclosure, and actually, any number of convolutional layers and any dimension feature map may be applied to the present disclosure, and are not described herein again.

In this embodiment, the prediction information includes at least prediction location information and prediction category information. The predicted position information comprises position information of a plurality of pairs of predicted boundary points of target voice in the original voice graph, and the predicted category information comprises predicted probabilities of the target voice in a plurality of categories.

Optionally, the step S204 includes: inputting the original voice graph into the voice segmentation model; the voice segmentation model outputs a plurality of first feature maps with different scales; generating a plurality of pairs of default boundary points on the first feature map; and performing convolution calculation on the first feature map to obtain a plurality of one-dimensional vectors, wherein each element in the one-dimensional vectors corresponds to a pair of default boundary points, and the value of each element is the prediction information of the target voice. The default boundary point is a point with preset position information, and the setting principle is as follows: the smaller the scale, the greater the distance between a pair of default boundary points on the first feature map. This is because the smaller scale feature maps can be used to identify longer target speech, while the larger scale feature maps can be used to identify shorter target speech. After generating a plurality of pairs of default boundary points, convolving the first feature map by using a plurality of convolution kernels to obtain a plurality of one-dimensional vectors, wherein each element in the one-dimensional vectors corresponds to a pair of default boundary points, and the value of each element is the prediction position information and the prediction category information corresponding to the corresponding pair of default boundary points.

Optionally, the generating a plurality of pairs of default boundary points on the first feature map includes: and generating a plurality of pairs of default boundary points by taking each group of pixel points with the same abscissa on the first characteristic diagram as central lines, wherein the logarithm of the default boundary points corresponding to each central line is the same. Fig. 8 is a schematic diagram of generating default boundary points, where a feature diagram 801 is the first feature diagram, where 802 is a group of pixel points with the same abscissa in the first feature diagram 801, and these pixel points together form a vertical line segment, and as shown in 803, multiple pairs of default boundary points are generated on both sides of the vertical line segment formed by the pixel points, and distances from each pair of default boundary points to the vertical line segment are equal, and lengths of line segments formed by multiple pairs of default boundary points corresponding to each vertical line segment are different. For each group of ideal pixels on the abscissa in the first feature map, multiple pairs of default boundary points are generated according to the same length, for example, as shown in fig. 8, 2 pairs of default boundary points corresponding to each group of vertical pixels are generated, and as there are 5 groups of vertical pixels, there are 10 pairs of default boundary points that need to be generated. The above-described operations are performed on each first feature map to generate a plurality of pairs of default boundary points. As shown in fig. 8, the default boundary point may be mapped onto the original speech map by a proportional relationship, so that the position information of the default boundary point may be used as anchor point information when generating predicted position information.

Optionally, the performing convolution calculation on the first feature map to obtain a plurality of one-dimensional vectors includes:

and (C +2) K convolution kernels are used for performing convolution on the first feature map to obtain (C +2) K1N one-dimensional vectors, wherein K is the logarithm of default boundary points corresponding to each central line, C is the number of classifications of the target voice, and N is the number of the central lines. Wherein 2 represents the position information of two boundary points in the pair of boundary points, specifically it may be coordinates of the boundary points on the horizontal axis, and since the vertical coordinate is 0, each point can be represented by only one horizontal coordinate. C is the number of categories in the classification information, K is the logarithm of the default boundary point corresponding to each centerline, and K is 2 in the above example. Assuming that the number of classes in the classification information is 4 in the above example, a total of (2+4) × 2 ═ 12 convolution kernels is required. Let the first feature map be a feature map of M × N × P, where M, N, P are the height, width, and depth of the first feature map, each convolution kernel is a convolution kernel of M × 3 × P, and convolution is performed with a step of 1 by padding (padding) to obtain 12 one-dimensional vectors of 1 × N, where the 12 one-dimensional vectors may form a feature vector of 1 × N12, each element in the feature vector corresponds to two boundary points corresponding to the center line, and the depth of each element is 12, where 4 values in the depth direction represent predicted position information of two pairs of default boundary points, and the other 8 values represent predicted category information of the two pairs of default boundary points. As shown in fig. 9, the prediction information is a schematic diagram of the prediction information represented by the feature vector, and the prediction information is a one-dimensional feature vector whose elements in the depth direction represent the prediction position information of the boundary point position of the two prediction boundary points and the prediction category information of the categories represented by the two prediction boundary points, respectively.

Step S205, calculating the error of the prediction information and the annotation information according to an objective function;

in this embodiment, an objective function is preset as a total loss function of the speech segmentation model, and an error between the prediction information and the labeling information is calculated through the objective function.

Optionally, when the one-dimensional vector is obtained in step S204, step S205 includes:

Illustratively, the following function may be used as an objective function of the speech segmentation model:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

wherein the content of the first and second substances,

wherein

As can be seen from the above objective function, the objective function includes two parts, namely, a position objective function L_loc(x, L, g) and class objective function L_conf(x, c); the method comprises the following steps that N is the number of positive samples in a default boundary point pair, the positive samples are default boundary point pairs of which the contact ratio of the default boundary point pair and a labeled boundary point pair is larger than a preset threshold value, and the contact ratio is the contact ratio of a line segment formed by connecting the default boundary point pair and a line segment formed by connecting the labeled boundary point pair.

At L_locIn (x, l, g),

is an index parameter, and the value is 0 or 1 when

The time is that the ith pair of default boundary points is matched with the jth pair of labeled boundary points, and the category of the standard boundary points is p; pos represents a positive sample, that is, the error of the position information is calculated only for the positive sample; cx₁And cx₂Abscissa representing two boundary points of a pair of boundary points;

and

two abscissas that represent pairs of marked boundary points,

and

two abscissas representing default pairs of boundary points;

and

two prediction position information representing the speech segmentation model, according to the formula

And

calculating the horizontal coordinates of two predicted boundary points

And

at L_confIn the step (x, c),

representing the probability that the prediction class is P;

α is a weight coefficient, illustratively α ═ 1.

Step S206, updating the parameters of the voice segmentation model according to the errors;

in this embodiment, when the error is greater than or equal to a preset first threshold, the adjustment value of the parameter is calculated. For example, the adjustment value may be calculated according to a gradient descent method, a newton method, or the like; subtracting the adjustment value from the parameter to obtain an updated parameter.

Step S207, inputting the original voice map into the voice segmentation model after updating the parameters to iterate the above parameter updating process until the error is smaller than the first threshold.

In this step, the original speech map is continuously input into the speech segmentation model, the parameters in the speech segmentation model at this time are updated parameters, and the steps S204 to S206 are continuously executed until the error is smaller than the first threshold. At this time, the parameters of the speech segmentation model are the trained parameters, and the training of the speech segmentation model is finished.

It can be understood that the iteration number may also be directly set without setting the first threshold, and the condition that the training is finished at this time is that the iteration number reaches the preset number, which is not described herein again.

The trained voice segmentation model predicts the position information of the boundary point of the target voice and the type of the target voice through the original voice graph, so that the method can be suitable for the voice segmentation of people who speak non-simultaneously, the end point detection of the voice, the awakening word alignment and other scenes. And the model based on the image is used, the complexity is greatly smaller than that of the model based on the voice signal, and the accuracy is improved.

In one embodiment, the present disclosure also discloses a speech segmentation method, including:

acquiring a voice file to be recognized;

dividing the voice file to be recognized into a plurality of voice frames;

inputting an original voice picture of the voice file to be recognized into a voice segmentation model, wherein the voice segmentation model is obtained by training the voice segmentation model by a training method;

In the present disclosure, the input voice file is acquired by an audio source. Optionally, the audio sources in this step are various audio acquisition devices, such as microphones of various forms, which acquire the voice from the environment and convert it into a voice file. In which the converted audio file is obtained from the audio capture device. Typically, as shown in fig. 1, the terminal device 102 includes an audio collecting device, such as a microphone, through which the voice in the environment where the terminal device is located can be collected.

Optionally, the audio source in this step is a storage space for storing the voice file. The storage space may be a local storage space or a remote storage space, and in this optional embodiment, acquiring the input voice file requires first acquiring an address of the storage space, and then acquiring the voice file from the storage space.

In the above steps, the detailed process of obtaining the original voice map of the voice file to be recognized is the same as that in the above training process, and all the processes are to process the voice signal into the feature map of the voice, so as to input the feature map into the voice segmentation model based on the image to output the position and the category of the target voice. Specifically, refer to the descriptions in step S201 to step S203, which are not described herein again.

It is understood that, in the using process, the position information of the prediction boundary points output by the speech segmentation model may include a plurality of positions, for example, the speech segmentation model outputs position information of Q pairs of prediction boundary points, and in this case, the Q pairs of position information also need to be processed to obtain a final pair of prediction boundary points. Sorting R predicted boundary points of which the categories are P according to the predicted probability value of the R belonging to Q, and reserving Z pairs of predicted boundary points before the sorting, wherein Z is a preset value; selecting a pair of prediction boundary points with the maximum probability value in the Z pairs of prediction boundary points, traversing the residual prediction boundary points, and deleting the prediction boundary points if the contact ratio of the residual prediction boundary points and the prediction boundary points with the maximum probability is greater than a preset threshold value; and (4) continuously selecting one prediction boundary point with the maximum probability from the unprocessed prediction boundary points, repeating the process, and finally obtaining the final prediction boundary point by using the remaining pair of prediction boundary points.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

Fig. 10 is a schematic structural diagram of an embodiment of a training apparatus for a speech segmentation model provided in an embodiment of the present disclosure, and as shown in fig. 10, the apparatus 1000 includes: an original voice map obtaining module 1001, a labeling information obtaining module 1002, a parameter initializing module 1003, a prediction information obtaining module 1004, an error calculating module 1005, a parameter updating module 1006 and a parameter iteration module 1007. Wherein the content of the first and second substances,

an original voice map obtaining module 1001 configured to obtain an original voice map of a sample voice file;

a labeling information obtaining module 1002, configured to obtain labeling information of a target voice in the original voice graph;

a parameter initialization module 1003, configured to initialize model parameters of the speech segmentation model;

a prediction information obtaining module 1004, configured to input the original voice map into the voice segmentation model to obtain prediction information of the target voice output by the voice segmentation model, where the prediction information of the target voice is obtained through a plurality of first feature maps with different scales output by the voice segmentation model;

an error calculating module 1005, configured to calculate an error between the prediction information and the labeling information according to an objective function;

a parameter updating module 1006, configured to update parameters of the speech segmentation model according to the error;

a parameter iteration module 1007, configured to input the original speech map into the speech segmentation model after updating the parameters to iterate the above parameter updating process until the error is smaller than the first threshold.

Further, the original voice map obtaining module 1001 is further configured to:

acquiring a sample voice file;

dividing the sample voice file into a plurality of voice frames;

acquiring a voice frame length α and a voice frame moving interval β;

Further, the prediction information obtaining module 1004 is further configured to:

inputting the original voice graph into the voice segmentation model;

Further, the prediction information obtaining module 1004 generates a plurality of pairs of default boundary points on the first feature map, including:

Further, the obtaining, by the prediction information obtaining module 1004, a plurality of one-dimensional vectors by performing convolution calculation on the first feature map includes:

Further, the error calculation module 1005 is further configured to:

The apparatus shown in fig. 10 can perform the method of the embodiment shown in fig. 1-9, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 1-9. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 9, and are not described herein again.

The embodiment of the present disclosure further discloses a speech segmentation apparatus, including:

Referring now to FIG. 11, shown is a schematic diagram of an electronic device 1100 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the electronic device 1100 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1101 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage means 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the electronic device 1100 are also stored. The processing device 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

Generally, the following devices may be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1107 including, for example, Liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices 1108, including, for example, magnetic tape, hard disk, etc.; and a communication device 1109. The communication means 1109 may allow the electronic device 1100 to communicate wirelessly or wiredly with other devices to exchange data. While fig. 11 illustrates an electronic device 1100 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication device 1109, or installed from the storage device 1108, or installed from the ROM 1102. The computer program, when executed by the processing device 1101, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an original voice graph of a sample voice file; acquiring the labeling information of the target voice in the original voice graph; initializing model parameters of a voice segmentation model; inputting the original voice graph into the voice segmentation model to obtain the prediction information of the target voice output by the voice segmentation model, wherein the prediction information of the target voice is obtained through a plurality of first feature graphs with different scales output by the voice segmentation model; calculating the error of the prediction information and the labeling information according to an objective function; updating parameters of the voice segmentation model according to the error; and inputting the original voice graph into the voice segmentation model after the parameters are updated so as to iterate the parameter updating process until the error is smaller than a first threshold value.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for training a speech segmentation model, the method comprising:

acquiring an original voice graph of a sample voice file;

initializing model parameters of a voice segmentation model;

updating parameters of the voice segmentation model according to the error;

2. The method for training a speech segmentation model according to claim 1, wherein the obtaining of the original speech map of the sample speech file comprises:

acquiring a sample voice file;

dividing the sample voice file into a plurality of voice frames;

3. The method for training a speech segmentation model according to claim 2, wherein the dividing the sample speech file into a plurality of speech frames comprises:

acquiring a voice frame length α and a voice frame moving interval β;

4. The method for training a speech segmentation model according to claim 2, wherein the extracting the speech feature points in the plurality of speech frames to generate an original speech map of a sample speech file comprises:

5. The method for training a speech segmentation model according to claim 1, wherein the labeling information includes: labeling position information and labeling category information, wherein the prediction information comprises: predicted position information and prediction category information.

6. A method for training a speech segmentation model according to claim 5, wherein the labeling position information includes a pair of labeling boundary points of the target speech in the original speech map, and the labeling classification information includes labeling probabilities of the target speech in a plurality of classes; the predicted position information comprises position information of a plurality of pairs of predicted boundary points of the target voice in the original voice graph, and the predicted category information comprises predicted probabilities of the target voice in a plurality of categories.

7. The method for training a speech segmentation model according to claim 1, wherein the inputting the original speech map into the speech segmentation model to obtain prediction information of the target speech output by the speech segmentation model comprises:

inputting the original voice graph into the voice segmentation model;

8. The method of training a speech segmentation model of claim 7 wherein generating a plurality of pairs of default boundary points on the first feature map comprises:

9. The method for training a speech segmentation model of claim 8 wherein said convolving the first feature map into a plurality of one-dimensional vectors comprises:

10. The method for training a speech segmentation model according to claim 9, wherein the calculating the error of the prediction information and the labeling information according to an objective function comprises:

11. A method of speech segmentation, comprising:

acquiring a voice file to be recognized;

dividing the voice file to be recognized into a plurality of voice frames;

inputting an original voice map of the voice file to be recognized into a voice segmentation model, wherein the voice segmentation model is obtained by training the voice segmentation model in the training method of claims 1-10;

12. An apparatus for training a speech segmentation model, comprising:

13. A speech segmentation apparatus comprising:

an input module, configured to input an original voice map of the voice file to be recognized into a voice segmentation model, where the voice segmentation model is trained by the training method of the voice segmentation model according to claims 1 to 10;

14. An electronic device, comprising:

a memory for storing computer readable instructions; and

a processor for executing the computer readable instructions, such that the processor when executing implements the method for training a speech segmentation model according to any one of claims 1 to 10 or the method for speech segmentation according to claim 11.

15. A non-transitory computer-readable storage medium storing computer-readable instructions which, when executed by a computer, cause the computer to perform the method of training a speech segmentation model according to any one of claims 1 to 10 or the method of speech segmentation according to claim 11.