CN111710331A

CN111710331A - Voice scheme setting method and device based on multi-slice deep neural network

Info

Publication number: CN111710331A
Application number: CN202010854664.XA
Authority: CN
Inventors: 蒋忆; 郁强; 沈瑶
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-09-25
Anticipated expiration: 2040-08-24
Also published as: CN111710331B

Abstract

The invention discloses a voice scheme setting method and device based on a multi-slice deep neural network. The method comprises the following steps: acquiring an initial feature vector of voice for case setting; inputting the initial feature vector into a main deep neural network to obtain a main feature vector; segmenting the principal eigenvector into at least three segmented eigenvectors in the length dimension of the principal eigenvector; respectively inputting the at least three segmentation feature vectors into at least three slice depth neural networks to obtain at least three sub-feature vectors, wherein the at least three slice depth neural networks are obtained through independent training; combining the at least three sub-feature vectors to obtain a feature vector for classification; and classifying the feature vector for classification with a Softmax classifier to obtain a classification result of the speech feature vector, the classification result indicating whether to plan based on the speech for planning. Therefore, the accuracy of classification is improved, and the robustness of classification performance is higher.

Description

Voice scheme setting method and device based on multi-slice deep neural network

Technical Field

The present application relates to the field of artificial intelligence technology, and more particularly, to a method and an apparatus for voice filing based on a multi-slice deep neural network, and an electronic device.

Background

The smart city is a city informatization advanced form which fully applies a new generation of information technology to various industries in the city. In modern city management, "placement" occurs frequently and involves aspects of city management. The proposal is the initial stage of whole litigation process and is also an essential process of criminal litigation. At present, the work of setting up a case mainly depends on manpower, and is not efficient and has timeliness (related personnel cannot set up a case after work).

At present, deep learning and neural networks have been widely applied in the fields of computer vision, natural language processing, speech signal processing, and the like. In addition, deep learning and neural networks also exhibit a level close to or even exceeding that of humans in the fields of image classification, object detection, semantic segmentation, text translation, and the like.

Deep learning and development of neural networks provide new solutions and schemes for case operation.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a voice scheme setting method, a voice scheme setting device and electronic equipment based on a multi-slice deep neural network, wherein the same deep neural network is segmented to divide a main neural network in the depth direction and divide a plurality of slice deep neural networks in the height direction of a model, and a plurality of sub-feature vectors are output from the plurality of slice deep neural networks, so that when the obtained sub-feature vectors are classified by a Softmax classification function, the classification accuracy can be improved, and the classification performance is high in robustness.

According to one aspect of the present application, there is provided a voice filing method based on a multi-slice deep neural network, comprising:

acquiring an initial feature vector of voice for case setting;

inputting the initial feature vector into a main deep neural network to obtain a main feature vector;

dividing the main feature vector into three or more than three segmented feature vectors in the length dimension of the main feature vector;

respectively inputting the three or more slicing feature vectors into three or more slicing deep neural networks to obtain three or more sub-feature vectors, wherein the three or more slicing deep neural networks and the main deep neural network are respectively the slicing of the deep neural network model in depth and height, and the three or more slicing deep neural networks are obtained by independent training;

combining the three or more sub-feature vectors to obtain a feature vector for classification; and

classifying the feature vector for classification with a Softmax classifier to obtain a classification result for the feature vector for classification, the classification result indicating whether to fix a case based on the voice for fixing a case.

In the above voice scenario method based on multi-slice deep neural network, dividing the principal feature vector into three or more than three segmented feature vectors in the length dimension of the principal feature vector includes: and cutting the main characteristic vector into three cutting characteristic vectors with equal length in the length dimension of the main characteristic vector.

In the above voice filing method based on multi-slice deep neural network, combining the three or more sub-feature vectors to obtain a feature vector for classification includes: splicing the three sub-feature vectors in parallel to obtain a feature map; and pooling the feature maps by the maximum values in the direction of parallel stitching to obtain the feature vector for classification.

In the above voice filing method based on multi-slice deep neural network, combining the three or more sub-feature vectors to obtain the feature vector for classification includes: and splicing the three sub-feature vectors in the length direction of the sub-feature vectors to obtain feature vectors for classification.

In the above method for filing a speech based on a multi-slice deep neural network, obtaining an initial feature vector of a speech for filing includes: acquiring voice for setting up a case; converting the speech to text; and converting the text into the initial feature vector through a word embedding model.

In the above voice filing method based on multi-slice deep neural network, the training process of the main deep neural network and the three or more slice deep neural networks includes:

acquiring a training feature vector of a case setting voice for training;

inputting the initial feature vector into a main deep neural network to obtain a main feature vector for training;

dividing the training main feature vector into three or more training segmentation feature vectors in the length dimension;

during training of each of the three or more slice-depth neural networks:

inputting one of the three or more training segmentation feature vectors into each slice deep neural network to obtain a training sub-feature vector;

passing the training sub-feature vectors through a Softmax classifier to obtain a Softmax loss function; and

updating parameters of the per-slice deep neural network by backpropagation of gradient descent based on the Softmax loss function

In the above-described voice filing method based on the multi-slice deep neural network, the three or more slice deep neural networks are trained in parallel.

In the above voice filing method based on multi-slice deep neural network, acquiring training feature vectors of filed voice for training includes: obtaining a voice data set of a case-finding voice for training, wherein the voice data set comprises positive samples marked as case-finding success and negative samples marked as case-finding failure; converting the case-setting voice of a positive sample and the case-setting voice of a negative sample in the voice data set into a positive sample feature vector and a negative sample feature vector respectively; and splicing the positive sample feature vector and the negative sample feature vector into the training feature vector.

According to another aspect of the present application, there is provided a voice filing apparatus based on a multi-slice deep neural network, including:

an initial feature vector acquisition unit configured to acquire an initial feature vector of a voice for filing;

a dominant eigenvector generating unit, configured to input the initial eigenvector obtained by the initial eigenvector obtaining unit into a dominant deep neural network to obtain a dominant eigenvector;

the segmentation feature vector generation unit is used for segmenting the main feature vector obtained by the main feature vector generation unit into three or more segmentation feature vectors in the length dimension of the main feature vector;

the sub-feature vector generation unit is used for respectively inputting the three or more segmentation feature vectors obtained by the segmentation feature vector generation unit into three or more slice depth neural networks to obtain three or more sub-feature vectors, wherein the three or more slice depth neural networks and the main depth neural network are respectively the segmentation of the depth neural network model in depth and height, and the three or more slice depth neural networks are obtained by independent training;

a classification feature vector generation unit configured to combine the three or more sub-feature vectors obtained by the sub-feature vector generation unit to obtain a feature vector for classification; and

a classifying unit configured to classify the feature vector for classification obtained by the classification feature vector generating unit with a Softmax classifier to obtain a classification result of the feature vector for classification, the classification result indicating whether to put a case based on the voice for putting a case.

In the above speech planning apparatus based on multi-slice deep neural network, the segmentation feature vector generation unit is further configured to: and dividing the main feature vector obtained by the main feature vector generation unit into three segmentation feature vectors with equal length in the length dimension of the main feature vector.

In the above-mentioned voice filing apparatus based on a multislice deep neural network, the classification feature vector generation unit includes:

the splicing subunit is used for splicing the three sub-feature vectors obtained by the sub-feature vector generation unit in parallel to obtain a feature map; and

and the pooling subunit is used for pooling the maximum values of the feature maps obtained by the splicing subunit in the direction of parallel splicing to obtain the feature vector for classification.

In the above-mentioned voice proposal device based on multi-slice deep neural network, the classification feature vector generation unit is further configured to concatenate the three sub-feature vectors obtained by the sub-feature vector generation unit in the length direction of the sub-feature vectors to obtain feature vectors for classification.

In the above voice scenario apparatus based on multi-slice deep neural network, the initial feature vector obtaining unit includes:

a voice acquiring subunit, configured to acquire a voice for setting up a case;

a text conversion unit for converting the speech obtained by the speech obtaining unit into a text; and

and the vector transformation unit is used for transforming the text obtained by the text transformation unit into the initial characteristic vector through a word embedding model.

In the above voice scenario device based on multi-slice deep neural network, the device further includes a training unit, configured to:

acquiring a training feature vector of a case setting voice for training;

during training of each of the three or more slice-depth neural networks:

updating parameters of the per-slice deep neural network by backpropagation of gradient descent based on the Softmax loss function.

In the above-described voice filing apparatus based on a multi-slice deep neural network, the three or more slice deep neural networks are trained in parallel.

In the above apparatus for speech proposal based on multi-slice deep neural network, the training unit is further configured to: obtaining a voice data set of a case-finding voice for training, wherein the voice data set comprises positive samples marked as case-finding success and negative samples marked as case-finding failure; converting the case-setting voice of a positive sample and the case-setting voice of a negative sample in the voice data set into a positive sample feature vector and a negative sample feature vector respectively; and splicing the positive sample feature vector and the negative sample feature vector into the training feature vector.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the multi-slice deep neural network-based speech proposal method as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the multi-slice deep neural network based speech proposal method as described above.

Compared with the prior art, the voice scheme method, the voice scheme device and the electronic equipment based on the multi-slice deep neural network divide the same deep neural network into the main neural network in the depth direction and the plurality of slice deep neural networks in the height direction of the model by dividing the same deep neural network, and output a plurality of sub-feature vectors from the plurality of slice deep neural networks.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 illustrates an application scenario diagram of a voice filing method based on a multi-slice deep neural network according to an embodiment of the present application.

Fig. 2 illustrates a flow chart of a voice filing method based on a multi-slice deep neural network according to an embodiment of the present application.

Fig. 3 illustrates a schematic diagram of a system architecture of a multi-slice deep neural network-based voice filing method according to an embodiment of the present application.

Fig. 4 illustrates a flowchart of a training method of a master deep neural network and three or more slice deep neural networks in a multi-slice deep neural network-based voice filing method according to an embodiment of the present application.

Fig. 5 illustrates a block diagram of a voice filing apparatus based on a multi-slice deep neural network according to an embodiment of the present application.

Fig. 6 illustrates a block diagram of a categorical feature vector generation unit in a multi-slice deep neural network-based speech proposal apparatus according to an embodiment of the present application.

Fig. 7 illustrates a block diagram of an initial feature vector obtaining unit in a multi-slice deep neural network-based voice proposal apparatus according to an embodiment of the present application.

FIG. 8 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Overview of a scene

As described above, in modern city management, the occurrence frequency of "plan" is high and the plan relates to the aspect of city management, and at present, plan work mainly depends on manpower, and is inefficient and time-efficient. Deep learning and development of neural networks provide new solutions and schemes for case operation.

Specifically, when a proposal is made by voice, it is essentially possible to determine whether or not to make a proposal by classifying the voice used for proposal, that is, by binary classifying the voice used for proposal, to determine whether to make a proposal or not to make a proposal based on the voice. However, in practical applications, the judgment criteria for whether to settle a case are relatively fuzzy, there is no very uniform rule, and the scenes corresponding to the case, including the aspects of city management, are numerous, and each specific scene may have a great difference, so that the statistical rule for judging whether to settle a case can hardly be extracted. However, if the binary classification is simply performed by converting the speech into the feature vector and then using the network structure of the deep neural network + Softmax classifier, the problems of high misjudgment rate and low robustness of the judgment accuracy performance also exist.

The inventor of the present application has found that the above-mentioned influence on the accuracy of the determination is partly caused by the ambiguity of the rules of whether to plan itself or not, and the difference between different rules in different scenes, so that in the high-latitude feature space obtained by the deep neural network, the class boundary and the data manifold intersect with each other, and thus it is not easy to demarcate each feature point in the space by the class boundary. Due to the linear characteristics of the deep neural network, the intersection between the class boundary and the data manifold is transferred among layers of the deep neural network, so that the classification accuracy is reduced when the class boundary is converted into the characteristic vector and is classified by a Softmax classification function, and the robustness of the classification performance is low.

In view of the above technical problems, the basic concept of the present application is to segment the same deep neural network model, specifically, first segment the main deep neural network in the depth direction of the model, and then segment the main deep neural network in the height direction of the model into a plurality of slice deep neural networks, wherein the plurality of slice deep neural networks are obtained by individual training. Then, a plurality of sub-feature vectors are output from the plurality of slice-depth neural networks. Thus, while each slice neural network is still linear in character, different slice-depth neural networks tend to converge in different directions during separate training processes, so that the resulting sub-feature vectors are clustered within a set with respect to feature vectors obtained from the same depth neural network, while decorrelation is achieved between sets. This is equivalent to the fact that the data sub-manifold corresponding to each feature vector shrinks and the distance between the data sub-manifolds increases, so that when the obtained sub-feature vectors are classified by the Softmax classification function, the classification accuracy can be improved, and the robustness of the classification performance is high.

Based on this, the present application proposes a voice filing method based on a multi-slice deep neural network, which includes: acquiring an initial feature vector of voice for case setting; inputting the initial feature vector into a main deep neural network to obtain a main feature vector; dividing the main feature vector into three or more than three segmented feature vectors in the length dimension of the main feature vector; respectively inputting the three or more slicing feature vectors into three or more slicing deep neural networks to obtain three or more sub-feature vectors, wherein the three or more slicing deep neural networks and the main deep neural network are respectively the slicing of the deep neural network model in depth and height, and the three or more slicing deep neural networks are obtained by independent training; combining the three or more sub-feature vectors to obtain a feature vector for classification; and classifying the feature vector for classification with a Softmax classifier to obtain a classification result of the speech feature vector, the classification result indicating whether to plan based on the speech for planning.

In the application scenario illustrated in fig. 1, in the training phase, a case-setting voice for training is collected first; then, a master deep neural network and at least three-slice deep neural network deployed in a server (e.g., S as illustrated in fig. 1) are trained based on the training-to-protocol speech. After the training is completed, in a detection phase, inputting the voice for proposal to be processed into the server to classify the voice for proposal to be processed by the main deep neural network and the at least three-slice deep neural network after the training is completed so as to obtain a classification result, wherein the classification result indicates whether the voice for proposal is put or not.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

As shown in fig. 2, a voice filing method based on a multi-slice deep neural network according to an embodiment of the present application includes the steps of: s110, acquiring initial feature vectors of voice for case setting; s120, inputting the initial feature vector into a main depth neural network to obtain a main feature vector; s130, dividing the main characteristic vector into three or more than three segmented characteristic vectors in the length dimension of the main characteristic vector; s140, respectively inputting the three or more slicing depth neural networks into three or more slicing depth neural networks to obtain three or more sub-feature vectors, wherein the three or more slicing depth neural networks and the main depth neural network are respectively the slicing of the depth neural network model in depth and height, and the three or more slicing depth neural networks are obtained by independent training; s150, combining the three or more than three sub-feature vectors to obtain feature vectors for classification; and S160, classifying the feature vector for classification by a Softmax classifier to obtain a classification result of the feature vector for classification, wherein the classification result indicates whether to plan based on the voice for planning.

Fig. 3 illustrates a schematic diagram of a system architecture of a multi-slice deep neural network-based voice filing method according to an embodiment of the present application. As shown in fig. 3, a system architecture of a voice filing method based on a multi-slice deep neural network according to an embodiment of the present application includes: a master deep neural network (e.g., DNNm as illustrated in fig. 3) and at least a three-slice deep neural network (specifically, in the example illustrated in fig. 3, the at least three-slice deep neural network is implemented as three-slice deep neural networks, DNN1, DNN2, and DNN3, respectively), wherein, the main deep neural network is used for processing the obtained initial feature vector (for example, Vi as illustrated in fig. 3) of the voice for putting a case to obtain a main feature vector (for example, Vm as illustrated in fig. 3), and further, the master eigenvector is sliced into at least three sliced eigenvectors in the length dimension of the master eigenvector (specifically, in the example illustrated in fig. 3, the master eigenvector is sliced into three sliced eigenvectors in the length dimension of the master eigenvector, Vs1, Vs2, and Vs3, respectively); the at least three slice-depth neural networks are respectively used for processing the sliced feature vector to obtain at least three sub-feature vectors (specifically, in the example illustrated in fig. 3, three sub-feature vectors, namely, Vz1, Vz2 and Vz3 are included), and further, the at least three sub-feature vectors are combined to generate a feature vector (for example, Vc as illustrated in fig. 3) for classification; then, a Softmax classifier classifies the feature vectors for classification to obtain a classification result of the speech feature vectors, the classification result indicating whether to put a case based on the speech for putting a case.

In step S110, an initial feature vector of a voice for filing is acquired. In the embodiment of the present application, the process of obtaining an initial feature vector of a voice for filing specifically includes: first acquiring voice for a proposal (for example, acquiring voice of a target object through a sound sensor); the speech is then converted to text (e.g., collected speech data is converted to text data by speech recognition techniques); then, the text is converted into the initial feature vector through a word embedding model.

It is worth mentioning that the voice for case setting is converted into text, which can facilitate further data arrangement, such as adding format information, text labeling, and the like. Of course, in other examples of the present application, the collected speech may also be directly converted into the initial feature vector and input into the deep neural network model for processing, which is not limited in the present application.

In step S120, the initial feature vector is input into a dominant deep neural network to obtain a dominant feature vector. Here, the master deep neural network is obtained by dividing the deep neural network model in the depth direction, that is, by pruning the number of layers of the deep neural network model in the depth direction to obtain the master deep neural network.

In step S130, the principal eigenvector is segmented into three or more segmented eigenvectors in the length dimension of the principal eigenvector. Preferably, in the embodiment of the present application, the principal eigenvector is partitioned into three segmentation eigenvectors with equal lengths in the length dimension of the principal eigenvector, so that the segmentation eigenvectors with equal lengths are computationally relatively simple differently and each segmentation eigenvector may contain the same number of features, so that when the classification is performed by the finally obtained eigenvector for classification, all weight terms may be considered with the same weight, and the accuracy of the classification is improved. That is, in the embodiment of the present application, dividing the principal eigenvector into three or more than three segmented eigenvectors in the length dimension of the principal eigenvector includes: and cutting the main characteristic vector into three cutting characteristic vectors with equal length in the length dimension of the main characteristic vector.

Of course, in other examples of the present application, the main feature vector may be divided into three or more segmentation feature vectors in the length dimension of the main feature vector in other manners, for example, the segmentation feature vectors are divided into three or more segmentation feature vectors with slightly different lengths.

In step S140, the three or more slicing feature vectors are respectively input into three or more slicing deep neural networks to obtain three or more sub-feature vectors, where the three or more slicing deep neural networks and the main deep neural network are respectively the slicing of the deep neural network model in depth and height, and the three or more slicing deep neural networks are obtained by separate training.

Accordingly, step S140 corresponds to dividing into a plurality of slice depth neural networks in the height direction of the deep neural network model, and outputting a plurality of sub-feature vectors from the plurality of slice depth neural networks. In particular, in the embodiments of the present application, the plurality of slice-depth neural networks are obtained by being trained individually such that the plurality of slice-depth neural networks tend to converge in different directions during their individual training. Thus, although each of the slice-depth neural networks is still linear in character, the different slice-depth neural networks tend to converge in different directions, so that the obtained sub-feature vectors are clustered inside a set with respect to feature vectors obtained by the same depth neural network, and decorrelation is performed between sets. This is equivalent to that the data sub-manifold corresponding to each sub-feature vector shrinks, and the distance between the data sub-manifolds increases, so that when the obtained sub-feature vectors are classified by the Softmax classification function, the classification accuracy can be improved, and the robustness of the classification performance is high.

Here, the reason why three or more than three slice-depth neural networks are adopted is to further increase the randomness of the convergence direction of each slice-depth neural network, that is, the convergence directions of two slice-depth neural networks easily form symmetry, which affects the classification effect of the finally obtained feature vectors.

In step S150, the three or more sub-feature vectors are combined to obtain a feature vector for classification.

In a specific example of the present application, the process of combining the three or more sub-feature vectors to obtain a feature vector for classification includes: firstly, the three sub-feature vectors are spliced in parallel to obtain a feature map; then, the feature maps are pooled by the maximum values in the direction of parallel stitching to obtain the feature vector for classification.

Correspondingly, by splicing the three sub-feature vectors into the feature map in parallel and then performing maximum pooling in the parallel splicing direction, it is equivalent to select the maximum value of each position of the three sub-feature vectors in the length direction of the three sub-feature vectors as the feature value of the finally generated feature vector, which can further reduce the relevance of the finally generated feature vector for classification based on each position.

In another specific example of the present application, combining the three or more sub-feature vectors to obtain the feature vector for classification includes: and splicing the three sub-feature vectors in the length direction of the sub-feature vectors to obtain feature vectors for classification.

Accordingly, the splicing mode is simple in calculation, and due to the fact that the sub-feature vectors are spliced in the length direction, the feature value of the corresponding position of each sub-feature vector keeps a certain distance on the whole spliced feature vector, and the accuracy of classification is guaranteed to a certain extent.

In step S160, the feature vector for classification is classified with a Softmax classifier to obtain a classification result of the feature vector for classification, the classification result indicating whether to settle based on the voice for settling. Namely, the spliced result after splicing the at least three sub-feature vectors is classified by a Softmax classifier to obtain a classification result of whether to settle or not.

In summary, a speech filing method based on a multi-slice deep neural network according to an embodiment of the present application is clarified, in which a same deep neural network is divided into a main neural network in a depth direction and a plurality of slice deep neural networks in a height direction of a model, and a plurality of sub-feature vectors are output from the plurality of slice deep neural networks, so that when the obtained sub-feature vectors are classified by a Softmax classification function, the classification accuracy can be improved, and the robustness of the classification performance is high.

It should be noted that, in the embodiment of the present application, the main deep neural network and the at least three-slice deep neural networks in the system architecture of the voice filing method based on the multi-slice deep neural network are obtained by training the voice data set of the filing voice for training, and the voice data set of the filing voice for training has a label of whether the voice is a filing.

Specifically, the master deep neural network and the at least three-slice deep neural network according to the embodiment of the present application may be obtained by training as follows:

first, training feature vectors of a case-setting speech for training are acquired. Specifically, in this example, obtaining training feature vectors for the trained case speech includes: obtaining a voice data set of a case-finding voice for training, wherein the voice data set comprises positive samples marked as case-finding success and negative samples marked as case-finding failure; then, converting the case-setting voice of a positive sample and the case-setting voice of a negative sample in the voice data set into a positive sample feature vector and a negative sample feature vector respectively; then, the positive sample feature vector and the negative sample feature vector are spliced into the training feature vector. Accordingly, by obtaining the training feature vector of the case-setting speech for training in such a way, on one hand, the relationship features between the positive sample and the negative sample can be learned through the joint training between the positive sample and the negative sample, and on the other hand, the problem of small number of negative samples can be solved by using the resampling idea.

Then, inputting the initial feature vector into a main deep neural network to obtain a main feature vector for training;

then, dividing the training main feature vector into three or more training segmentation feature vectors in the length dimension;

wherein, in the training process of each of the three or more slice-depth neural networks: firstly, inputting one of the three or more training segmentation feature vectors into each slice deep neural network to obtain a training sub-feature vector; then, passing the training sub-feature vector through a Softmax classifier to obtain a Softmax loss function; then, parameters of the per-slice deep neural network are updated by backpropagation of gradient descent based on the Softmax loss function.

It should be understood that the three or more slice-depth neural networks are trained separately, so that the three or more slice-depth neural networks tend to converge in different directions, respectively, so that sub-feature vectors obtained by the three or more slice-depth neural networks are clustered inside a set with respect to feature vectors obtained by the same depth neural network, and decorrelation is performed between sets. Therefore, when the obtained sub-feature vectors are classified by a Softmax classification function, the classification accuracy can be improved, and the robustness of the classification performance is high.

Fig. 4 illustrates a flowchart of a training method of a master deep neural network and three or more slice deep neural networks in a multi-slice deep neural network-based voice filing method according to an embodiment of the present application. As shown in fig. 4, the method for training the master deep neural network and the three or more slice deep neural networks includes: s210, obtaining training feature vectors of the plan voice for training; s220, inputting the initial feature vector into a main deep neural network to obtain a main feature vector for training;

s230, dividing the training main feature vector into three or more training segmentation feature vectors in the length dimension; s240, inputting one of the three or more training segmentation feature vectors into each slice deep neural network to obtain a training sub-feature vector; s250, passing the training sub-feature vectors through a Softmax classifier to obtain a Softmax loss function; and S260, updating the parameters of each slice depth neural network through backward propagation of gradient descent based on the Softmax loss function.

It should be noted that, in the embodiment of the present application, the three or more slice depth neural networks can be trained in a parallel manner, that is, the three or more slice depth neural networks are trained simultaneously, so that the time for training can be saved. Of course, the three or more slice depth neural networks can also be trained in a one-by-one manner, which is not limited by the present application. Also, in order to make it easier for the three or more slice-depth neural networks to tend to converge in different directions, the subsets of speech data used to train the three or more slice-depth neural networks, respectively, may be different subsets.

Exemplary devices

As shown in fig. 5, the voice filing apparatus 500 according to the embodiment of the present application includes: an initial feature vector obtaining unit 510, configured to obtain an initial feature vector of a voice for filing; a dominant eigenvector generating unit 520, configured to input the initial eigenvector obtained by the initial eigenvector obtaining unit 510 into a dominant deep neural network to obtain a dominant eigenvector; a segmentation feature vector generation unit 530, configured to segment the main feature vector obtained by the main feature vector generation unit 520 into three or more segmentation feature vectors in the length dimension of the main feature vector; a sub-feature vector generating unit 540, configured to input the three or more slicing feature vectors obtained by the slicing feature vector generating unit 530 into three or more slice deep neural networks respectively to obtain three or more sub-feature vectors, where the three or more slice deep neural networks and the main deep neural network are respectively the slices of the deep neural network model in depth and height, and the three or more slice deep neural networks are obtained through separate training; a classification feature vector generation unit 550 configured to combine the three or more sub-feature vectors obtained by the sub-feature vector generation unit 540 to obtain a feature vector for classification; and a classifying unit 560, configured to classify the feature vector for classification obtained by the classified feature vector generating unit 550 with a Softmax classifier to obtain a classification result of the feature vector for classification, where the classification result indicates whether to put a case based on the voice for putting a case.

In one example, in the above multi-slice deep neural network based speech proposal apparatus 500, the segmented feature vector generating unit 530 is further configured to: the principal eigenvector obtained by the principal eigenvector generation unit 520 is divided into three equal-length segmented eigenvectors in the length dimension of the principal eigenvector.

In one example, as shown in fig. 6, in the above-mentioned voice scenario device 500 based on multi-slice deep neural network, the classification feature vector generating unit 550 includes: a splicing subunit 551, configured to splice the three sub-feature vectors obtained by the sub-feature vector generation unit 540 in parallel to obtain a feature map; and a pooling subunit 552 configured to pool the feature maps obtained by the stitching subunit 551 by a maximum value in a direction of parallel stitching to obtain the feature vector for classification.

In an example, in the above multi-slice deep neural network based speech proposal apparatus 500, the classification feature vector generation unit 550 is further configured to concatenate the three sub-feature vectors obtained by the sub-feature vector generation unit 540 in the length direction of the sub-feature vectors to obtain feature vectors for classification.

In one example, as shown in fig. 7, in the above-mentioned voice scenario apparatus 500 based on multi-slice deep neural network, the initial feature vector obtaining unit 510 includes: a voice acquiring subunit 511 configured to acquire a voice for setting up a case; a text conversion unit 512, configured to convert the speech obtained by the speech obtaining unit into a text; and a vector transformation unit 513 configured to transform the text obtained by the text transformation unit 512 into the initial feature vector through a word embedding model.

In one example, in the above-mentioned voice scenario apparatus 500 based on multi-slice deep neural network, further comprises a training unit 570 for: acquiring a training feature vector of a case setting voice for training; inputting the initial feature vector into a main deep neural network to obtain a main feature vector for training; dividing the training main feature vector into three or more training segmentation feature vectors in the length dimension; during training of each of the three or more slice-depth neural networks: inputting one of the three or more training segmentation feature vectors into each slice deep neural network to obtain a training sub-feature vector; passing the training sub-feature vectors through a Softmax classifier to obtain a Softmax loss function; and updating parameters of the per-slice deep neural network through backpropagation of gradient descent based on the Softmax loss function.

In one example, in the multi-slice deep neural network based speech proposal apparatus 500 described above, the three or more slice deep neural networks are trained in parallel.

In one example, in the above multi-slice deep neural network based speech proposal apparatus 500, the training unit 570 is further configured to: obtaining a voice data set of a case-finding voice for training, wherein the voice data set comprises positive samples marked as case-finding success and negative samples marked as case-finding failure; converting the case-setting voice of a positive sample and the case-setting voice of a negative sample in the voice data set into a positive sample feature vector and a negative sample feature vector respectively; and splicing the positive sample feature vector and the negative sample feature vector into the training feature vector.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the multi-slice deep neural network based voice proposal apparatus 500 described above have been described in detail in the above description of the multi-slice deep neural network based voice proposal method with reference to fig. 1 to 4, and thus, a repetitive description thereof will be omitted.

As described above, the voice proposal apparatus 500 based on the multi-slice deep neural network according to the embodiment of the present application may be implemented in various terminal devices, such as a server for updating the neural network, and the like. In one example, the voice filing apparatus 500 based on multi-slice deep neural network according to the embodiment of the present application may be integrated into a terminal device as one software module and/or hardware module. For example, the multi-slice deep neural network-based voice proposal apparatus 500 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the voice filing apparatus 500 based on multi-slice deep neural network can also be one of many hardware modules of the terminal device.

Alternatively, in another example, the multi-slice deep neural network-based voice proposal apparatus 500 and the terminal device may be separate devices, and the multi-slice deep neural network-based voice proposal apparatus 500 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 8.

As shown in fig. 8, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the multi-slice deep neural network-based voice filing method of the various embodiments of the present application described above and/or other desired functions. Various contents such as an initial feature vector, a main feature vector, a sliced feature vector, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 13 may include, for example, a keyboard, a mouse, and the like.

The output device 14 can output various information including the classification result to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 8, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the multi-slice deep neural network based speech proposal method according to various embodiments of the present application described in the "exemplary methods" section of this specification above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the multi-slice deep neural network based speech filing method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

Claims

1. A voice scheme setting method based on a multi-slice deep neural network is characterized by comprising the following steps:

acquiring an initial feature vector of voice for case setting;

2. The multi-slice deep neural network-based speech solution method of claim 1, wherein segmenting the principal feature vector into three or more segmented feature vectors in a length dimension of the principal feature vector comprises:

and cutting the main characteristic vector into three cutting characteristic vectors with equal length in the length dimension of the main characteristic vector.

3. The multi-slice deep neural network-based speech proposal method according to claim 2, wherein combining the three or more sub-feature vectors to obtain feature vectors for classification comprises:

splicing the three sub-feature vectors in parallel to obtain a feature map; and

pooling the feature maps by maxima in the direction of parallel stitching to obtain the feature vectors for classification.

4. The multi-slice deep neural network-based speech proposal method according to claim 1 or 2, wherein the combining of the three or more sub-feature vectors to obtain the feature vector for classification comprises:

and splicing the three sub-feature vectors in the length direction of the sub-feature vectors to obtain feature vectors for classification.

5. The multi-slice deep neural network-based speech proposal method according to claim 1, wherein obtaining initial feature vectors of the speech for proposal comprises:

acquiring voice for setting up a case;

converting the speech to text; and

and converting the text into the initial feature vector through a word embedding model.

6. The multi-slice deep neural network-based speech proposal method according to claim 1, wherein the training process of the main deep neural network and the three or more slice deep neural networks comprises:

acquiring a training feature vector of a case setting voice for training;

during training of each of the three or more slice-depth neural networks:

7. The multi-slice deep neural network-based speech proposal method according to claim 6, wherein the three or more slice deep neural networks are trained in parallel.

8. The multi-slice deep neural network-based speech proposal method according to claim 7, wherein obtaining training feature vectors for the trained proposal speech comprises:

obtaining a voice data set of a case-finding voice for training, wherein the voice data set comprises positive samples marked as case-finding success and negative samples marked as case-finding failure;

converting the case-setting voice of a positive sample and the case-setting voice of a negative sample in the voice data set into a positive sample feature vector and a negative sample feature vector respectively; and

and splicing the positive sample feature vector and the negative sample feature vector into the training feature vector.

9. A voice filing apparatus based on a multi-slice deep neural network, comprising:

10. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the multi-slice deep neural network-based speech proposal method according to any one of claims 1-8.