CN112017638A

CN112017638A - Voice semantic recognition model construction method, semantic recognition method, device and equipment

Info

Publication number: CN112017638A
Application number: CN202010938197.9A
Authority: CN
Inventors: 符文君
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-01

Abstract

The application relates to a voice semantic recognition model construction method, a semantic recognition device and equipment, wherein the construction method comprises the following steps: the method comprises the steps of firstly extracting voice features from voice sample signals, carrying out hidden code operation on feature values randomly selected according to preset rules in the voice features, respectively inputting the voice features subjected to hidden code operation into a first decoding layer and a second decoding layer for decoding after the voice features subjected to hidden code operation are coded by the models, generating first conditional probability and second conditional probability of the voice features corresponding to an ith semantic label based on different decoding results, and determining that the voice semantic recognition models are completely constructed when the voice semantic recognition models meet preset requirements according to the first conditional probability and the second conditional probability, wherein the hidden code operation can increase data different from original samples in number, and the expenses required by sample collection and labeling are reduced.

Description

Voice semantic recognition model construction method, semantic recognition method, device and equipment

Technical Field

The present application relates to the field of speech semantic recognition technology, and in particular, to a speech semantic recognition model construction method, a semantic recognition device, and a speech semantic recognition device.

Background

With the rise of applications such as voice assistants, the application of the voice semantic recognition technology is more and more popular, and currently, when performing voice semantic recognition, the voice semantic recognition is generally required to be firstly recognized into a text by using a voice recognition module, and then the text is semantically understood by using a natural language understanding module, so as to obtain a final semantic result.

In the related art, when a speech recognition model is constructed, a large amount of pre-training data is used for pre-training a speech recognition module and a natural language understanding module, and then a large amount of sample data is used for further training the speech semantic recognition model constructed based on the speech recognition module and the natural language understanding module until training conditions are met, so that the construction of the speech recognition model is completed. In the above process, a large amount of pre-training data and sample data are involved, and these data usually need to be collected and labeled in advance, and a high overhead is required, so that the construction cost of the speech semantic recognition model is too high.

Disclosure of Invention

In order to overcome the problems in the related art at least to a certain extent, the application provides a speech semantic recognition model construction method, a semantic recognition device and equipment.

According to a first aspect of the present application, there is provided a method of extracting speech features from a speech sample signal;

extracting voice features from the voice sample signal;

randomly selecting a characteristic value from the voice characteristics according to a preset selection rule to perform hidden code operation;

inputting the voice features subjected to the hidden code operation into a pre-constructed voice semantic recognition model, wherein the pre-constructed voice semantic recognition model comprises the following steps: the decoding device comprises a coding layer, a first decoding layer and a second decoding layer;

coding the voice characteristics subjected to the hidden code operation through the coding layer to obtain a coding result;

inputting the coding result into the first decoding layer, and generating a first conditional probability corresponding to the voice feature belonging to a pre-configured ith semantic tag based on the decoding result after the coding result is decoded;

inputting the coding result into the second decoding layer, decoding the coding result, and generating a second conditional probability that the voice feature belongs to the ith semantic tag based on the decoding result, wherein i is a positive integer;

and when the voice semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability, determining that the voice semantic recognition model is constructed.

Optionally, determining whether the speech semantic recognition model meets a preset requirement according to the first conditional probability and the second conditional probability specifically includes:

generating a verification value according to the first conditional probability and the second conditional probability;

and determining a difference value between the verification value and a reference value, and when the difference value meets the preset requirement, determining that the voice semantic recognition model meets the preset requirement, and finishing the construction of the voice semantic recognition model.

Optionally, the randomly selecting a feature value from the speech features according to a preset selection rule to perform a hidden code operation specifically includes:

generating a voice characteristic spectrogram according to the voice characteristics;

randomly selecting a target image area from the spectrogram, and performing hidden coding on a characteristic value in the target image area;

and/or randomly selecting a target time region by taking the time dimension corresponding to the spectrogram as a reference, and performing hidden coding on the characteristic value in the target time region;

and/or randomly selecting a target frequency region by taking the frequency dimension corresponding to the spectrogram as a reference, and performing hidden coding on the characteristic value in the target frequency region.

Optionally, before the encoding layer encodes the speech feature subjected to the hidden code operation and obtains an encoding result, the method further includes:

and performing down-sampling operation on the voice features subjected to the hidden code operation.

According to a second aspect of the present application, there is provided a semantic recognition method, the method comprising:

extracting voice features from a voice signal to be recognized;

inputting the voice features into a coding layer of a voice semantic recognition model constructed by the method of the first aspect of the application, and acquiring a coding result;

decoding the coding result in a first decoding layer, and determining an ith semantic label of the voice feature on an nth dimension and a first conditional probability between the coding result and all target semantic labels on a 1 st dimension to an n-1 st dimension which are acquired in advance based on the decoding result;

decoding the coding result in a second decoding layer, and determining an ith semantic label of the voice feature on an nth dimension and a second conditional probability between the coding result and all target semantic labels on a 1 st dimension to an n-1 st dimension which are obtained in advance based on the decoding result;

determining a label score corresponding to the ith semantic label of the voice feature on the nth dimension according to the first conditional probability and the second conditional probability;

and determining the semantic label with the maximum label score as the semantic label of the voice feature in the nth dimension from the label scores respectively corresponding to all the semantic labels, wherein i is a positive integer, n is a positive integer larger than 2, and the target semantic label in the first dimension is directly obtained according to the voice feature.

According to a third aspect of the present application, there is provided a speech semantic recognition model construction apparatus, including:

the feature extraction module is used for extracting voice features from the voice sample signal;

the hidden code module is used for randomly selecting a characteristic value from the voice characteristics according to a preset selection rule to perform hidden code operation;

the first input module is used for inputting the voice features subjected to the hidden code operation into a pre-constructed voice semantic recognition model, wherein the pre-constructed voice semantic recognition model comprises: the decoding device comprises a coding layer, a first decoding layer and a second decoding layer;

the coding module is used for coding the voice characteristics subjected to the hidden code operation through the coding layer to obtain a coding result;

the first decoding module is used for inputting the coding result into the first decoding layer, and generating a first conditional probability corresponding to the fact that the voice feature belongs to the pre-configured ith semantic tag based on the decoding result after the coding result is decoded;

the second decoding module is used for inputting the coding result into the second decoding layer, and generating a second conditional probability corresponding to the voice feature belonging to the ith semantic tag based on the decoding result after the coding result is decoded, wherein i is a positive integer;

and the determining module is used for determining that the speech semantic recognition model is constructed completely when the speech semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability.

Optionally, the hidden code module includes:

the second generation unit is used for generating a voice characteristic spectrogram according to the voice characteristics;

the first hidden code unit is used for randomly selecting a target image area from the spectrogram and hiding a characteristic value in the target image area;

and/or the second hidden code unit is used for randomly selecting a target time region by taking the time dimension corresponding to the spectrogram as a reference, and hiding the characteristic value in the target time region;

and/or the third hidden code unit is used for randomly selecting a target frequency region by taking the frequency dimension corresponding to the spectrogram as a reference, and hiding the characteristic value in the target frequency region.

According to a fourth aspect of the present application, there is provided a speech semantic recognition apparatus, the apparatus comprising:

the characteristic extraction module is used for extracting voice characteristics from the voice signal to be recognized;

an input module, configured to input the speech features into a coding layer of a speech semantic recognition model constructed by the apparatus according to the third aspect of the present application, and obtain a coding result;

a first decoding module, configured to decode the coding result in a first decoding layer, and determine, based on the decoding result, an ith semantic tag of the speech feature in an nth dimension, and a first conditional probability between the coding result and all target semantic tags in a 1 st dimension to an n-1 st dimension that are obtained in advance;

a second decoding module, configured to decode the coding result in a second decoding layer, and determine, based on the decoding result, an ith semantic tag of the speech feature in an nth dimension, and a second conditional probability between the coding result and all target semantic tags in a 1 st dimension to an n-1 st dimension that are obtained in advance;

a probability determination module, configured to determine, according to the first conditional probability and the second conditional probability, a tag score corresponding to an ith semantic tag of the speech feature in an nth dimension;

and the tag determining module is used for determining the semantic tag with the maximum tag score as the target semantic tag of the voice feature in the nth dimension from the tag scores respectively corresponding to all the semantic tags, wherein i is a positive integer, n is a positive integer greater than 2, and the target semantic tag in the first dimension is directly obtained according to the voice feature.

First decoding module the second decoding module according to a fifth aspect of the present application, there is provided a speech semantic recognition apparatus, comprising:

the speech semantic recognition system comprises a processor and a memory, wherein the processor is used for executing an application program starting program stored in the memory so as to realize the speech semantic recognition model building method of the first aspect of the application or the semantic recognition method of the second aspect of the application.

According to a sixth aspect of the present application, there is provided a computer storage medium storing one or more programs executable by a speech semantic recognition apparatus according to the fifth aspect of the present application to implement the speech semantic recognition model construction method according to the first aspect of the present application or the semantic recognition method according to the second aspect of the present application.

The technical scheme provided by the application can comprise the following beneficial effects: in the training process of the speech semantic recognition model, a large number of speech sample signals are used, the more the number of the speech sample signals is, the higher the accuracy of the trained model is, however, more work is needed for collecting a large number of speech sample signals, the higher the accuracy requirement of the model training is, and the higher the cost of the model training is. Therefore, in the method, after the speech features are extracted from the speech sample signal, firstly the speech features are subjected to the hidden code operation, then the speech features subjected to the hidden code operation are input into the pre-constructed speech semantic recognition model, and the training process is completed.

Therefore, although the same speech sample signal is used in the two training processes, after the characteristic value is randomly selected for carrying out the code hiding, the speech characteristics which are input into the model and participate in the training after the code hiding operation are different, that is, after the characteristic value in the speech characteristics is randomly selected for carrying out the code hiding, the speech characteristics of the same speech sample signal can be converted into the speech characteristics after different hidden codes, the model training can be completed for many times by using the same speech sample signal, the influence on the accuracy of the model training is small, a large number of speech sample signals do not need to be collected before the training, and the cost of the model training is effectively reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a speech semantic recognition model construction method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for performing a masking operation on speech features according to another embodiment of the present application;

FIG. 3 is a flow chart illustrating a semantic recognition method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a speech semantic recognition model building apparatus according to another embodiment of the present application;

fig. 5 is a schematic structural diagram of a crypto module according to another embodiment of the present application;

FIG. 6 is a schematic structural diagram of a speech semantic recognition apparatus according to another embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech semantic recognition device according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for constructing a speech semantic recognition model according to an embodiment of the present application.

As shown in fig. 1, in this embodiment, the method for constructing a speech semantic recognition model may include:

and step S101, extracting voice features from the voice sample signal.

Specifically, when extracting the speech feature from the speech sample signal, in consideration of expressing the characteristics in the speech sample signal in more detail, in a specific example, the speech feature in step S101 may be obtained by first extracting an 80-dimensional Log mel feature from the speech sample signal through a Log-mel filter bank, then extracting a 3-dimensional pitch (pitch) feature from the speech sample signal, and then performing normalization processing on the 80-dimensional Log mel feature and the 3-dimensional pitch feature.

In an actual environment, when a device is used to record a voice sample signal, the voice sample signal is often influenced by different types of microphones of different devices through audio channels, so that features of the same factor are greatly different, and therefore, when Normalization processing is performed on an 80-dimensional logarithmic mel feature and a 3-dimensional pitch feature, a Cepstral Mean and Variance Normalization (CMVN) processing mode may be adopted to obtain a feature with a Mean of 0 and a Variance of 1 as a voice feature in step S101.

And S102, randomly selecting a characteristic value from the voice characteristics according to a preset selection rule to perform hidden code operation.

In this step, the hidden code operation may specifically include two steps: selecting a characteristic value and hiding a code. Since the speech feature is actually represented as a matrix formed by a plurality of eigenvalues, performing the hidden code operation on the speech feature means performing the hidden code operation on a part of eigenvalues in the speech feature, and the part of eigenvalues can be randomly selected according to a preset selection rule.

For a specific hidden code, all selected characteristic values can be replaced by a set value correspondingly, or the average value of all selected characteristic values is obtained, and the selected characteristic values are replaced by the average value correspondingly so as to hide the characteristics of the selected characteristic values on the voice characteristics, thereby realizing the hidden code.

For the preset selection rule, there may be a plurality of rules, specifically, refer to fig. 2, where fig. 2 is a specific flow diagram for performing a hidden code operation on a speech feature provided in the present application.

As shown in fig. 2, the specific process of performing the hidden code operation on the speech feature may include:

step S201, generating a voice characteristic spectrogram according to the voice characteristics.

Specifically, in the speech feature extraction process, it is based on time frames of the speech sample signal, and in the frequency representation, the feature value corresponding to each time frame has a corresponding frequency, so based on the speech feature in step S101, a spectrogram with an abscissa as time and an ordinate as frequency is first generated, and the feature values in the speech feature are distributed at corresponding positions of the spectrogram according to the time frame and the corresponding frequency in which they are located. In order to cooperate with the description of the selection of specific characteristic values in the subsequent steps, the maximum value on the time axis in the spectrogram can be recorded as σ, and the maximum value on the frequency axis can be recorded as σ_μ。

After the spectrogram is generated, a target image area is selected according to the spectrogram, and then hidden codes are carried out on characteristic values in the target image area. The specific implementation may be performed in one or more of the following ways. Specifically referring to steps 202 to 204, steps 202 to 204 respectively describe different hidden code operation steps in step 3. The three operations of step 202 to step 204 are not sequentially executed according to the sequence number, but as described above, one operation of 202 to 204 may be arbitrarily selected to be executed, or multiple operations of the three operations may be selected to be executed in a superimposed manner, and how to specifically execute the operations is set according to the actual situation is not described herein too much.

Step S202, a target image area is randomly selected from the spectrogram, and hidden codes are carried out on characteristic values in the target image area.

In the step, the spectrogram is regarded as a picture, an image area is randomly selected from the picture as a target image area, then the characteristic value in the target image area is subjected to hidden coding, and finally the voice characteristic containing the characteristic value subjected to hidden coding is determined as the voice characteristic subjected to hidden coding operation.

Specifically, to determine the target image region, a time point on the time axis of the spectrogram may be determined, and then the width with the time point as the center point may be determined, so as to range the width and the frequency from 0 to 0_μThe image area of the overlapping range is determined as the target image area.

In one case, two values are arbitrarily taken first: time step number T and maximum time stack parameter W. Wherein, T and W are in the time dimension of the spectrogram, i.e. 0-sigma, and T is larger than 2W, then a random number is taken out from the interval range of (W, T-W) as the time point, and a random number is taken out from the interval range of (-W, W) as the width. The time point is taken as a center point, and the width range in the time axis direction is determined, for example, if the time point is a and the width is b, the width range in the time axis direction is (a-b/2, a + b/2). Finally, the time axis range (a-b/2, a + b/2) and the frequency axis range (0,_μ) The overlapped image area is the target image area. The feature value included in the target image region is the selected feature value to be subjected to the hidden code.

For the hidden code of the feature value, all the selected feature values can be replaced by a set value correspondingly, or the average value of all the selected feature values is obtained, and the selected feature values are replaced by the average value correspondingly so as to hide the characteristics of the selected feature values on the voice features, thereby realizing the hidden code.

Step S203, a target time region is randomly selected by taking the time dimension corresponding to the spectrogram as a reference, and hidden codes are carried out on the characteristic values in the target time region.

In this step, a target time region is determined based on the time dimension of the spectrogram, and all feature values involved in the target time region are feature values mentioned in step S102 and randomly selected according to a preset selection rule, and after the feature values in the target time region are subjected to a hidden code, the speech feature including the feature values subjected to the hidden code operation is determined as the speech feature subjected to the hidden code operation.

Specifically, a time hidden code parameter may be preset, and may be denoted as R_timeThen from the interval of time [0, R_time) Taking a random number which can be recorded as t, and then taking a random number from [0, sigma-t) which can be recorded as t₀Finally, the interval [ t ]₀,t₀+ t) is determined as the target time zone. And the characteristic value contained in the target time region is the selected characteristic value needing to be subjected to the hidden code.

For the hidden code of the feature value, refer to the manner described in step 202, and will not be described in detail here.

Step S204, taking the frequency dimension corresponding to the spectrogram as a reference, randomly selecting a target frequency region, and performing hidden coding on the characteristic values in the target frequency region.

In this step, a target frequency region is determined based on the frequency dimension of the spectrogram, and all feature values related to the target frequency region are feature values mentioned in step S102 and randomly selected according to a preset selection rule, and after the feature values in the target frequency region are subjected to a hidden code, the speech feature including the feature values subjected to the hidden code operation is determined as the speech feature subjected to the hidden code operation.

Specifically, a frequency hidden code parameter may be preset, and may be denoted as R_freqThen from the interval [0, R ] of the frequency_freq) Taking a random number which can be recorded as f, and then taking a random number from [0, mu-f) which can be recorded as f₀Finally, the interval [ f₀,f₀+ f) is determined as the target frequency region. The eigenvalue included in the target frequency region is the eigenvalue selected to be hidden.

For the hidden code of the feature value, see also the manner of step 202, which will not be described in detail herein.

For the speech features of the same speech sample signal, during each training, at least one crypto operation mentioned in step S102 may be performed, for example, 1, 2, or 3 crypto operations are performed on the speech features, and when more than 1 crypto operation is performed, the same manner of selecting feature values may be selected, or different manners of selecting feature values may be selected, for example, the method in step S202 is directly used to perform multiple crypto operations on the speech features, or the method in step S202 is first used to perform at least one crypto operation on the speech features, and then the method in step S203 is used to perform at least one crypto operation on the speech features.

Of course, step S202, step S203 and step S204 represent one type of hidden code operation, and when multiple hidden code operations are required, they can be randomly combined. Because the characteristic values in the implicit code operation process are selected randomly and the times of the implicit code operation can be preset, when only a general number of voice sample signals exist, the number of the distinguishing voice characteristics can be increased through the implicit code operation in the application, so that the function of increasing the voice sample signals is achieved.

Step S103, inputting the voice features subjected to the hidden code operation into a pre-constructed voice semantic recognition model, wherein the pre-constructed voice semantic recognition model comprises the following steps: the decoding device comprises an encoding layer, a first decoding layer and a second decoding layer.

In this embodiment, the pre-constructed speech semantic recognition model is provided with an encoding layer for encoding the speech features after code hiding, and the encoding layer may adopt a neural network structure, for example, any one of neural network structures such as a bidirectional Long-Short Term Memory model recurrent neural network (LSTM), a recurrent neural network (RNN transmitter), a translation model (fransformer) based on a self-attention mechanism, and the like may be selected as the encoding layer of the speech semantic recognition model.

In order to improve the accuracy of speech semantic recognition and avoid the problem that a single decoding mode is biased to a certain direction and neglects another direction, in this embodiment, a plurality of decoding layers may be used to decode the coding result of the coding layer, for example, a first decoding layer and a second decoding layer may be set to decode the coding result of the coding layer at the same time, and a first decoding layer, a second decoding layer, and a third decoding layer may also be set to decode, where the number and the type of the decoding layers may be determined according to the requirement of a project on the accuracy of speech semantic recognition.

Taking the example of setting the first decoding layer and the second decoding layer at the same time, the first decoding layer may decode based on a sequential classification algorithm (CTC), and in the sequential classification task, the conventional practice is that input data and a given tag must be aligned one to one in time, and the CTC practice is that training can be performed without aligning the tags one to one in time, and prediction made at any time of the input data is not of great concern, and it is of concern that the output corresponding to the input data is consistent with the tags as a whole, thereby reducing the tedious work of predefining time frames for the tags.

That is to say, in this embodiment, the first coding layer is decoded in a CTC manner, and the degree of consistency with the preconfigured ith semantic tag is determined from the entirety of the coding result of the coding layer, so that, in this embodiment, when the first decoding layer is decoded in the CTC manner, it is not necessary to determine which time frame the ith semantic tag corresponds to in the coding result, a process of defining the time frame for the speech tag is omitted, it is not necessary to align a given tag with the input data one by one on the time frame, and it is not necessary to define the tag on the time frame corresponding to the input data, which reduces workload in the model training process and speeds up the model training.

In addition, the second decoding layer can adopt a decoding mode based on an attention mechanism (attention mechanism), wherein the attention mechanism is a solution proposed by imitating human attention, and is simply to quickly screen out high-value information from a large amount of information. In this embodiment, when the second decoding layer adopts a decoding mode of an attribute mechanism, key information can be obtained from a coding result of the coding layer, so as to reduce the data volume in the subsequent process, and effectively accelerate the training speed of the speech semantic recognition model and the speed of recognizing speech after the model training is completed. Specifically, a more mature location-based association mechanism or a point-and-multiply mechanism in the association mechanism may be adopted.

In a specific example, the first decoding layer adopts a CTC decoding method, the second decoding layer adopts a decoding method based on an attention mechanism, and since the emphasis points of the decoding methods of the two decoding layers are different, the decoding result is also different due to the difference of the emphasis points, and the subsequent steps are performed by using the respective decoding results of the two decoding layers, so that the two different emphasis points can be considered, and the accuracy of the model for performing the speech semantic recognition can be effectively improved.

And step S104, coding the voice characteristics subjected to the hidden code operation through a coding layer to obtain a coding result.

In this step, the coding layer converts the input speech features into a dense vector of fixed dimensions, so that the speech features are converted into quantities that can be operated by a mathematical method, i.e., the dense vector.

In addition, in order to better represent the overall characteristics of the speech features and reduce the data amount, the speech features after the hidden code operation may be down-sampled before the step S104.

Specifically, the down-sampling operation may be to perform dimension reduction on the speech features by means of linear transformation, so as to output the dimension-reduced speech features, and then input the dimension-reduced speech features to the coding layer for coding.

Specifically, in this embodiment, the coding layer may include a preprocessing layer and a coding sublayer, where the preprocessing layer may be a convolutional neural Network based on a super-resolution Geometry Group Network (VGG), and performs deeper extraction on the voice features after dimensionality reduction, so that the local features have a better representation, and the coding sublayer converts the voice features after deep extraction into the dense vector.

And S105, inputting the coding result into a first decoding layer, decoding the coding result, and generating a first conditional probability corresponding to the preset ith semantic tag of the voice feature based on the decoding result.

It should be noted that, in the method of this embodiment, the speech sample signal is identified based on a plurality of different dimensions, and the number and the type of the dimensions are often determined by a plurality of information portions included in a type of speech sample signal, for example, for speech related to speech control of smart home, the speech often includes time, an operation object, an operation instruction, and the like, and then, for the type of speech sample signal, speech semantic identification can be performed from the time dimension, the operation object dimension, and the operation instruction dimension.

For each dimension, a plurality of semantic tags are preconfigured in this embodiment, and generally all possible semantic tags related to the dimension are configured to the dimension, and the purpose of this embodiment is to determine one semantic tag from all the semantic tags preconfigured corresponding to each dimension, as a target semantic tag of the corresponding dimension.

Assuming that a voice sample signal has N dimensions and I semantic tags are configured in advance for the nth dimension, step S105 is to process the ith semantic tag corresponding to the nth dimension to obtain a first conditional probability of the coding result based on the first coding layer corresponding to the ith semantic tag.

That is, in this step, the encoding result is input to the first decoding layer, and after the encoding result is decoded, the first conditional probability of the ith semantic tag in the pre-configured semantic tag group of the speech feature with respect to a certain dimension is generated based on the decoding result.

In this embodiment, for example, the first decoding layer and the second decoding layer are simultaneously arranged, and then the method of this embodiment further includes:

and S106, inputting the coding result into a second decoding layer, decoding the coding result, and generating a second conditional probability corresponding to the ith semantic tag of the voice feature based on the decoding result, wherein i is a positive integer.

The process of generating the second conditional probability involves the same relative meaning of the ith semantic tag as in step S105. In this step, the encoding result is input into a second decoding layer, and after the encoding result is decoded, a second conditional probability of the voice feature with respect to the ith semantic tag in the preconfigured semantic tag group of a certain dimension is generated based on the decoding result.

In addition, in the voice semantic recognition process, semantic tags between dimensions have certain relation, such as 'go to the kitchen for eating', and the place dimension already determines that 'the kitchen' is the target semantic tag of the dimension, so when the semantic tags are predicted for the action dimension (for example, three semantic tags of 'eating', 'bathing' and 'sleeping' are configured for the action dimension), compared with the semantic tag of 'the kitchen', 'eating' is inevitably more likely than the semantic tag of 'bathing' and 'sleeping', therefore, in order to utilize the relation of the semantic tags between different dimensions, the calculation condition for the first conditional probability may be determined based on the decoding result of the first decoding layer and all already determined target semantic tags, and similarly, the calculation condition for the second conditional probability may be determined based on the second decoding result and all the determined target semantic tags.

For example, the speech sample signal relates to N dimensions, and the target semantic label of the nth dimension is currently determined, wherein the previous 1 st to N-1 st dimensions respectively determine the target semantic labels corresponding to the respective dimensions. For the first decoding layer, determining a first conditional probability that the target semantic label of the nth dimension is the ith semantic label on the condition that the decoding result of the first decoding layer and all the target semantic labels of the 1 st to n-1 st dimensions are determined; for the second decoding layer, the second conditional probability that the target semantic label of the nth dimension is the ith semantic label may be determined on the condition that the decoding result of the second decoding layer and all the target semantic labels of the 1 st to n-1 st dimensions have been determined.

It should be noted that the target semantic tag is a semantic tag determined from a semantic tag group corresponding to its own dimension for each dimension.

And S107, when the voice semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability, determining that the voice semantic recognition model is built completely.

In this step, the determination process for determining whether the speech semantic recognition model meets the preset requirement may be performed in various manners, for example, the obtained first conditional probability and the second conditional probability may be directly verified, or verification data may be generated according to the first conditional probability and the second conditional probability and then verified.

Specifically, for the first mode, that is, the mode of directly verifying the obtained first conditional probability and the second conditional probability, the reference value of the probability type may be used for comparison. It should be noted that, in this embodiment, the reference value refers to a value obtained by pre-tagging a preconfigured semantic tag, and since the embodiment has at least two decoding layers (i.e. a first decoding layer and a second decoding layer), the model correspondingly outputs at least two probability values, so that when the preconfigured semantic tag is tagged, the number of decoding layers is adapted to be tagged, for example, the embodiment has the first decoding layer and the second decoding layer, and when the preconfigured semantic tag is tagged, the model correspondingly tags two reference values (e.g. a first reference value corresponding to the first decoding layer and a second reference value corresponding to the second decoding layer) are tagged.

Based on this, in this step, if a mode of directly verifying the obtained first conditional probability and the second conditional probability is adopted, a difference (hereinafter referred to as a first difference) between the first conditional probability and the first reference value and a difference (hereinafter referred to as a second difference) between the second conditional probability and the second reference value may be first obtained, and when the first difference and the second difference meet a preset requirement set for the mode, it is determined that the speech semantic recognition model meets the preset requirement, and the speech semantic recognition model is constructed.

In a specific example, if the preset requirement set for the present method is that both the first difference and the second difference are less than or equal to 0.02, for the ith semantic tag, the pre-marked first reference value is 0.1, the second reference value is 0.15, the first conditional probability is 0.08, and the second conditional probability is 0.14, then the first difference is 0.02, the second difference is 0.01, the first difference is equal to 0.02, and the second difference is less than 0.02, so that the first difference and the second difference meet the preset requirement set for the present method, the speech semantic recognition model meets the preset requirement, and the speech semantic recognition model is constructed.

For the second mode, namely the mode of generating verification data according to the first conditional probability and the second conditional probability and then verifying the verification data, when determining whether the speech semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability, the verification data can be generated according to the first conditional probability and the second conditional probability.

After the verification value is generated, the difference value between the verification value and the reference value can be determined, when the difference value meets the preset requirement, the voice semantic recognition model is determined to meet the preset requirement, and the voice semantic recognition model is built. Generally, the reference value is the content of pre-marking a pre-configured semantic tag, and the reference value is often determined according to the verification value, for example, the verification value is obtained by simply weighting a first conditional probability and a second conditional probability, and then the verification value is necessarily a probability value between 0 and 1, and then the reference value should be an expected probability value; for another example, the verification value is obtained by processing the first conditional probability and the second conditional probability by using some mathematical calculation method, for example, the verification value is generated by using the following formula:

L＝-αlogp^att(Y|X)-(1-α)p^ctc(Y|X)

wherein L is a verification value, alpha is a parameter, and p^ctc(Y | X) is the first conditional probability, p^att(Y | X) is the second conditional probability.

The proof value is certainly not a probability value but another value related to the probability, and the magnitude of the proof value may not be between 0 and 1, and the reference value should be set to a specific value according to the range in which the proof value may be.

In the labeling, only the semantic tags actually corresponding to the content of the speech sample signal in the semantic tag group may be labeled, or all the semantic tags may be labeled.

In this embodiment, after the first conditional probabilities and the second conditional probabilities of all semantic tags in the semantic tag group corresponding to the current dimension are obtained, it is determined whether the speech semantic recognition model meets the preset requirement.

If only the correctly corresponding semantic tag is marked, specifically, a verification value can be generated according to the first conditional probability and the second conditional probability of the correctly corresponding semantic tag, then a difference value between the verification value and a marked reference value is determined, when the difference value meets a preset requirement, the voice semantic recognition model is determined to meet the preset requirement, and the voice semantic recognition model is built. The preset requirement is that the difference is lower than a preset threshold.

If all the semantic tags are labeled, specifically, a verification value of each semantic tag may be obtained, and then a difference between the verification value of each semantic tag and a reference value thereof is obtained, where the preset requirement is that the difference between a preset number of semantic tags is lower than a preset threshold, or that an average value of all the differences is lower than the preset threshold.

The following describes the above steps S105 to S107 as an actual example:

for example, the voice sample signal is "go to the kitchen for eating", the involved dimensions may at least include a location dimension "kitchen" and an action dimension "eating", and the applicable range of the voice sample signal should be a life range, and therefore, when the semantic tag set is configured in advance, the voice sample signal can be configured according to the location dimension and the action dimension in the aforementioned life range, for example, the semantic tags of the life range such as "kitchen", "bedroom", "living room", "balcony" and the like can be in the location dimension configured semantic tag set for representing the location, and the semantic tags of the life range such as "eating", "sleeping", "washing the face", "brushing the teeth" and the like can be in the action dimension configured semantic tag set for representing the action.

According to project requirements or user habits, a default dimension order may be determined, for example, the determination of the target semantic tags is performed in an order of "place dimension" and "action dimension". The order of semantic tags in the semantic tag set can also be customized, for example, the semantic tag set in the place dimension includes "1, kitchen", "2, bedroom", "3, living room", "4, balcony", and the semantic tag set in the action dimension includes "1, eating", "2, sleeping", "3, washing face", "4, brushing teeth"

Then, for the location dimension, the actual process in step S105 may be to decode the coding result corresponding to the voice sample information to obtain a decoding result of the first coding layer, and since the location dimension is the first dimension to determine the target semantic tag, and there is no well-determined target semantic tag before, for the first dimension, only the decoding result can be used as a condition to calculate the first conditional probability of the ith semantic tag in the semantic tag group corresponding to the location dimension, where i is 1, that is, the 1 st semantic tag "kitchen", and the calculated conditional probability of "kitchen" may be 0.9. By analogy, the first conditional probability of all semantic tags in the semantic tag group corresponding to the place dimension is calculated by taking the decoding result as a condition, for example, "bedroom" is 0.3, "living room" is 0.5, "balcony" is 0.1.

In step S106, the second conditional probabilities of all semantic tags in the semantic tag group corresponding to the location dimension are obtained in the same order, for example, "kitchen" is 0.85, "bedroom" is 0.2, "living room" is 0.4, and "balcony" is 0.1, under the condition of the decoding result.

Using L ═ alpha logP^att(Y|X)-(1-α)p^ctcAnd (Y | X), fusing the first conditional probability and the second conditional probability to obtain the verification value of each semantic label, such as the verification value of kitchen is 0.87, the verification value of bedroom is 0.24, the verification value of living room is 0.46, and the verification value of balcony is 0.1.

The above values are shown in table 1.

	First conditional probability	Probability of second condition	Verifying a value
				Kitchen cabinet	0.9	0.85	0.87
Bedroom	0.3	0.2	0.24
				Parlor	0.5	0.4	0.46
Balcony	0.1	0.1	0.1

TABLE 1

Since the tags in the semantic tag set of the location dimension are labeled, only the correct semantic tags may be labeled, or all the semantic tags may be labeled, and the description of the cases is given here.

If only the correct semantic tag is marked, that is, the reference value of the "kitchen" is marked, for example, the reference value of the "kitchen" is 0.9, and the difference is smaller than 0.02 according to the preset requirement, the difference between the reference value and the verification value is 0.03, and it should be noted that the final difference may be an absolute value obtained by subtracting the reference value and the verification value. Since 0.03 is greater than 0.02 and does not meet the preset requirement, the weight parameters and the deviation parameters involved in the coding layer, the first decoding layer and the second decoding layer need to be adjusted at this time, and a specific mode may adopt an adjustment mode of utilizing back propagation of a difference value.

If all semantic tags are marked, the preset requirement is that at least two differences are smaller than 0.02 and/or the average value of the differences is smaller than 0.02, the reference value of "kitchen" is 0.85, the reference value of "bedroom" is 0.2, the reference value of "living room" is 0.4, the reference value of "balcony" is 0.1, and the corresponding differences are 0.02, 0.04, 0.06, 0, wherein the difference of only 1 semantic tag is smaller than 0.02, the average value is 0.03 and larger than 0.02, so that the preset condition is not met, at this time, the weight parameters and deviation parameters related to the coding layer, the first decoding layer and the second decoding layer need to be adjusted, and the specific mode can be an adjustment mode of back propagation by using the differences. The training process then continues.

At this time, the target semantic tag of the place dimension needs to be determined according to the verification value, and specifically, the maximum verification value may be selected as the target conditional probability of the place dimension. In this example, the verification value of "kitchen" is the largest, so "kitchen" is the target semantic tag of the place dimension.

Then, the semantic tags in the semantic tag group of the action dimension are processed, and the actual process of step S105 may be to decode the encoding result corresponding to the voice sample information to obtain the decoding result of the first encoding layer, and since the previous place dimension already determines the target semantic tag, that is, "kitchen", for the action dimension, the first conditional probability of the ith semantic tag in the semantic tag group corresponding to the action dimension is calculated on the condition that the decoding result and the target semantic tag of "kitchen" are used, where i is 1, that is, "meal" of the 1 st semantic tag, and the calculated conditional probability of "meal" may be 0.9. By analogy, with the decoding result as a condition, the first conditional probability of all semantic tags in the semantic tag group corresponding to the action dimension is calculated, for example, "sleep" is 0.3, "wash face" is 0.5, and "brush tooth" is 0.1.

In step S106, the second conditional probabilities of all semantic tags in the semantic tag group corresponding to the action dimension are obtained in the same order under the condition of the decoding result, for example, "eat" is 0.93, "sleep" is 0.2, "wash face" is 0.4, and "brush teeth" is 0.1.

Using L ═ alpha logP^att(Y|X)-(1-α)p^ctcAnd (Y | X), fusing the first conditional probability and the second conditional probability to obtain the verification values of the semantic labels, such as the verification value of 'eating' is 0.91, the verification value of 'sleeping' is 0.24, the verification value of 'washing face' is 0.46, and the verification value of 'brushing teeth' is 0.1.

The above values are shown in table 2.

	First conditional probability	Probability of second condition	Verifying a value
				Eating food	0.9	0.93	0.91
Sleep	0.3	0.2	0.24
				Face washing device	0.5	0.4	0.46
Tooth brushing device	0.1	0.1	0.1

TABLE 2

Since the labels in the semantic label set of the action dimension are labeled, only the correct semantic label may be labeled, or all the semantic labels may be labeled, and the description of the cases is given here.

If only the correct semantic tag is labeled, that is, the reference value of "eat" is labeled, for example, the reference value of "eat" is 0.9, and the difference is smaller than 0.02 according to the preset requirement, the difference between the reference value and the verification value is 0.01, and it should be noted that the final difference may be an absolute value obtained by subtracting the reference value and the verification value. Since 0.01 is less than 0.02, the preset requirement is met, and the construction of the speech semantic recognition model is completed at the moment.

If all semantic tags are marked, the preset requirement is that at least two differences are smaller than 0.02 and/or the average value of the differences is smaller than 0.02, the reference value of "eating" is 0.9, the reference value of "sleeping" is 0.23, the reference value of "washing face" is 0.44, the reference value of "brushing teeth" is 0.1, the corresponding differences are 0.01 for "eating", 0.01 for "sleeping", 0.02 for "washing face" and 0 for "brushing teeth", wherein the differences of 3 semantic tags are smaller than 0.02, the average value is 0.031 and smaller than 0.02, so that the preset condition is met, and the construction of the speech semantic recognition model is completed at this moment.

At this time, the target semantic tag of the action dimension needs to be determined according to the verification value, and specifically, the maximum verification value can be selected as the target conditional probability of the action dimension. In this example, "eat" is the largest verification value, and thus is the target semantic tag for the action dimension.

And outputting target semantic labels of all dimensions, namely, outputting a voice semantic recognition result, wherein the voice semantic recognition result is 'kitchen' or 'dining'.

Before determining the target semantic tags with the non-first dimension, the determined target semantic tags with all dimensions can be input into a language sub-model layer except a first decoding layer and a second decoding layer to determine a third conditional probability of an ith semantic tag corresponding to the dimension of the target semantic tag needing to be determined currently under the condition of all the target semantic tags, then the first conditional probability, the second conditional probability and the third conditional probability are fused to obtain tag scores, and after all the semantic tags obtain the respective corresponding tag scores, the semantic tag with the maximum tag score is taken as the target semantic tag with the single-tag dimension.

Taking the first decoding layer as ctc and the second decoding layer as attentioder as an example, the label score of the ith semantic label is determined by using the following formula. The formula is as follows:

logp(y_n|y_1:n-1,h_1:T')＝αlogp^ctc(y_n|y1:n-1,h_1:T')+(1-α)p^att(y_n|y_1:n-1,h_1:T')+βlogp^lm(y_n|y_1:n-1)

wherein, y_1:n-1Representing the determined target semantic tag, y_nThe ith semantic label is expressed, h1: T' represents the voice characteristic, alpha represents the preset parameter value, and beta represents the preset parameter value.

After extracting voice features from a voice sample signal, firstly, carrying out hidden code operation on feature values randomly selected from the voice features according to a preset selection rule, inputting the voice features subjected to the hidden code operation into a pre-constructed voice semantic recognition model, wherein the pre-constructed voice semantic recognition model comprises an encoding layer, a first decoding layer and a second decoding layer, the voice features subjected to the hidden code operation and input into the pre-constructed voice semantic recognition model are firstly encoded through the encoding layer to obtain an encoding result, the encoding result is input into the first decoding layer, after the encoding is decoded, a first conditional probability corresponding to an ith semantic tag to which the voice features belong is pre-configured is generated based on the decoding result, the encoding result is input into the second decoding layer, after the encoding result is decoded, a second conditional probability corresponding to the ith semantic tag to which the voice features belong is generated based on the decoding result, wherein i is a positive integer; and when the voice semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability, the voice semantic recognition model is determined to be constructed. According to the method, before the voice features are input into the pre-constructed voice semantic recognition model, the feature values selected randomly from the voice features are subjected to hidden code operation, the voice features with the hidden part of the feature values are different from the original voice features to a certain extent, and the feature values needing hidden code operation are selected randomly according to the preset selection rule, so that the voice features of one section of voice sample signal can generate a plurality of different voice features subjected to hidden code operation, the number of original voice sample signals can be effectively reduced, the expenses required for sample collection and labeling are reduced, and the construction cost of the voice semantic recognition model is reduced.

Referring to fig. 3, fig. 3 is a flowchart illustrating a semantic recognition method according to an embodiment of the present disclosure.

As shown in fig. 3, the semantic recognition method provided in this embodiment may include:

step S301, extracting voice characteristics from the voice signal to be recognized.

Specifically, in order to express the characteristics in the voice signal to be recognized in more detail, 80-dimensional logarithmic mel features may be extracted from the voice signal to be recognized through a Log-mel filter bank, 3-dimensional pitch (pitch) features may be extracted from the voice signal to be recognized, and normalization processing may be performed on the 80-dimensional logarithmic mel features and the 3-dimensional pitch features, so as to obtain the voice features in step S101.

In addition, in an actual environment, when the device is used to record a voice signal to be recognized, the voice signal to be recognized is often influenced by different types of microphones carried by different devices through audio channels, so that features of the same factor are greatly different, and therefore, when the Normalization processing is performed on the 80-dimensional logarithmic mel feature and the 3-dimensional pitch feature, a Cepstral Mean and Variance Normalization (CMVN) processing mode may be adopted to obtain a feature with a Mean value of 0 and a Variance of 1 as the voice feature in step S301.

Step S302, inputting the speech features into the coding layer of the speech semantic recognition model constructed by the method provided in the above embodiment, and obtaining a coding result.

In addition, the overall characteristics of the speech features are better represented, the data volume is reduced, and the speech features after the hidden code operation may be down-sampled before the step S104. Specifically, the down-sampling operation may be to perform dimension reduction on the speech features by means of linear transformation, so as to output the dimension-reduced speech features, and then input the dimension-reduced speech features to the coding layer for coding.

Specifically, in this embodiment, the coding layer may include a preprocessing layer and a coding sublayer, where the preprocessing layer may be a VGG-based convolutional neural network, and performs deeper extraction on the voice features after dimensionality reduction so as to make local features have a better representation, and the coding sublayer converts the voice features after deep extraction into the dense vector.

Step S303, decoding the coding result in the first decoding layer, and determining the ith semantic label of the voice feature in the nth dimension and the first conditional probability between the coding result and all the target semantic labels in the 1 st dimension to the (n-1) th dimension based on the decoding result.

Wherein the pre-configured ith semantic tag refers to the ith semantic tag in the semantic tag group pre-configured for the nth dimension of the voice signal to be recognized. It should be noted that a certain dimension of the speech signal to be recognized refers to one of a plurality of portions included in the speech signal to be recognized, for example, the speech signal to be recognized includes a time portion, a place portion, an action portion, and an object portion facing the action. For each dimension of the speech signal to be recognized, a semantic tag group may be configured in advance for each dimension, and each semantic tag group may include all possible semantic tags for its corresponding dimension.

Step S304, decoding the coding result in a second decoding layer, and determining an ith semantic label of the voice feature in an nth dimension and a second conditional probability between the coding result and all target semantic labels in the 1 st dimension to the n-1 st dimension based on the decoding result.

For speech semantic recognition by using the relation of semantic tags between different dimensions, the calculation condition of the conditional probability mentioned in this embodiment may be the decoding result and the target semantic tag of the determined dimension. For example, the speech sample signal relates to N dimensions, and the target semantic label of the nth dimension is currently determined, wherein the previous 1 st to N-1 st dimensions respectively determine the target semantic labels corresponding to the respective dimensions. For the first decoding layer, determining a first conditional probability that the target semantic label of the nth dimension is the ith semantic label on the condition that the decoding result of the first decoding layer and all the target semantic labels of the 1 st to n-1 st dimensions are determined; for the second decoding layer, the second conditional probability that the target semantic label of the nth dimension is the ith semantic label may be determined on the condition that the decoding result of the second decoding layer and all the target semantic labels of the 1 st to n-1 st dimensions have been determined.

Taking the first decoding layer in step S303 as an example, the coding features are input to the first decoding layer for decoding to obtain a decoding result, and for the ith semantic tag in the semantic tag group corresponding to the nth dimension, the decoding result and all the obtained target semantic tags from the 1 st dimension to the n-1 st dimension are taken as conditions to calculate the first conditional probability of the ith semantic tag. Of course, for the second coding layer, only the decoding result of the first decoding layer in the calculation condition is replaced with the decoding result of the second decoding layer, and the other parts are the same to obtain the second condition probability.

Step S305, determining a label score corresponding to the ith semantic label of the voice feature in the nth dimension according to the first conditional probability and the second conditional probability.

It should be noted that, according to the first conditional probability and the second conditional probability, determining the label score corresponding to the ith semantic label of the voice feature in the nth dimension may be obtained by using a weighting method, for example, respective weighting values are set for the first conditional probability and the second conditional probability, and the probabilities are multiplied by the respective weighting values and then summed to obtain the label score.

Step S306, determining the semantic label with the maximum label score as a target semantic label of the voice feature in the nth dimension from the label scores respectively corresponding to all the semantic labels, wherein i is a positive integer, n is a positive integer larger than 2, and the target semantic label in the first dimension is directly obtained according to the voice feature.

After the first conditional probability and the second conditional probability are calculated for all semantic tags in the semantic tag group corresponding to the nth dimension, each semantic tag corresponds to a tag score, and at the moment, the semantic tag with the maximum tag score is determined to be the target semantic tag of the voice feature in the nth dimension.

It should be noted that, for the first dimension of the speech feature, since there is no target semantic tag that has been determined before, for the first dimension, the conditional probability of each semantic tag in the semantic tag group corresponding to the first dimension may be calculated with the decoding result of the speech feature as a condition.

It should be noted that, starting from the second dimension, before determining the target semantic tags, the target semantic tags corresponding to all the determined dimensions may also be input into a language sub-model layer other than the first decoding layer and the second decoding layer to determine a third conditional probability of the ith semantic tag corresponding to the dimension of the target semantic tag that needs to be determined currently under the condition of all the target semantic tags, and then the first conditional probability, the second conditional probability and the third conditional probability are fused to obtain tag scores, and after all the semantic tags obtain the respective corresponding tag scores, the semantic tag with the largest tag score is taken as the target semantic tag of the single-tag dimension.

logp(y_n|y_1:n-1,h_1:T')＝αlogp^ctc(y_n|y_1:n-1,h_1:T')+(1-α)_p ^att(yn|y1:n-1,h_1:T')+βlogp^lm(y_n|y_1:n-1)

wherein, y_1:n-1Representing the determined target semantic tag, y_nDenotes the ith semantic tag, h_1:T'Representing speech characteristics, alpha representing a preset parameter value, beta representing a preset parameter value, p^ctcRepresenting the probability, p, based on the first decoding layer^attRepresenting the probability of obtaining an effect based on the second decoding layer, p^lmRepresenting the probabilities derived based on the language sub-model layer.

It should be noted that, as for the process of probability calculation involved in the present embodiment, reference may be made to the process of probability calculation mentioned in the above embodiments.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech semantic recognition model building apparatus according to another embodiment of the present application.

As shown in fig. 4, the apparatus provided in this embodiment may include:

a feature extraction module 401, configured to extract a voice feature from the voice sample signal;

a hidden code module 402, configured to perform hidden code operation on the voice feature;

an input module 403, configured to input the speech features subjected to the hidden code operation into a pre-constructed speech semantic recognition model, where the pre-constructed speech semantic recognition model includes: the decoding device comprises a coding layer, a first decoding layer and a second decoding layer;

the encoding module 403 is configured to encode the voice feature after the hidden code operation through the encoding layer to obtain an encoding result;

a first decoding module 405, configured to input the encoding result to a first decoding layer, and generate, based on the decoding result, a first conditional probability that the speech feature belongs to a preconfigured ith semantic tag after the encoding result is decoded;

the second decoding module 406 is configured to input the encoding result to a second decoding layer, and generate a second conditional probability that the speech feature belongs to the ith semantic tag based on the decoding result after the encoding result is decoded, where i is a positive integer;

the determining module 407 is configured to determine that the speech semantic recognition model is completely constructed when it is determined that the speech semantic recognition model meets the preset requirement according to the first conditional probability and the second conditional probability.

The apparatus of this embodiment may further include a down-sampling module before the encoding module, configured to perform down-sampling on the speech features that have undergone the hidden code operation.

For a specific structure of the determination module, refer to fig. 5, and fig. 5 is a schematic diagram of a specific structure of the determination module according to another embodiment of the present application.

Referring to fig. 5, fig. 5 is a schematic diagram of a specific structure of a hidden code module according to another embodiment of the present disclosure.

As shown in fig. 5, the covert code module may include:

a generating unit 501, configured to generate a speech feature spectrogram according to a speech feature;

a first hidden code unit 502, configured to randomly select a target image region from a spectrogram, and determine a voice feature in the target image region as a voice feature after a hidden code operation;

or the second hidden code unit 503 is configured to randomly select a target time region based on a time dimension corresponding to the spectrogram, and determine a voice feature belonging to the target time region as a voice feature subjected to hidden code operation;

or the third implicit code unit 504 is configured to randomly select a target frequency region based on a frequency dimension corresponding to the spectrogram, and determine that a speech feature belonging to the target frequency region is a speech feature subjected to an implicit code operation.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a speech semantic recognition apparatus according to another embodiment of the present application.

As shown in fig. 6, the speech semantic recognition apparatus may include:

a feature extraction module 601, configured to extract a voice feature from a voice signal to be recognized;

an input module 602, configured to input the speech features into an encoding layer of the speech semantic recognition model constructed according to the method of any one of claims 1 to 4, and obtain an encoding result;

a first decoding module 603, configured to decode the coding result in the first decoding layer, and determine, based on the decoding result, an ith semantic tag of the speech feature in an nth dimension, and a first conditional probability between the coding result and all target semantic tags in a 1 st dimension to an n-1 st dimension that are obtained in advance;

a second decoding module 604, configured to decode the coding result in a second decoding layer, and determine, based on the decoding result, an ith semantic tag of the speech feature in an nth dimension, and a second conditional probability between the coding result and all target semantic tags in a 1 st dimension to an n-1 st dimension that are obtained in advance;

a probability determining module 605, configured to determine, according to the first conditional probability and the second conditional probability, a tag score corresponding to an ith semantic tag of the speech feature in an nth dimension;

the tag determining module 606 is configured to determine, from the tag scores corresponding to all the semantic tags, that the semantic tag with the largest tag score is the target semantic tag of the voice feature in the nth dimension, where i is a positive integer, n is a positive integer greater than 2, and the target semantic tag in the first dimension is directly obtained according to the voice feature.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech semantic recognition device according to another embodiment of the present application.

As shown in fig. 7, the speech semantic recognition device provided by this embodiment may include:

a processor 701 and a memory 702, the processor is configured to execute an application program startup program stored in the memory to implement the speech semantic recognition model building method or the semantic recognition method provided by the above embodiments.

The method for constructing the speech semantic recognition model can comprise the following steps:

extracting voice features from the voice sample signal;

The semantic recognition method can comprise the following steps:

extracting voice features from a voice signal to be recognized;

inputting the voice features into a coding layer of a voice semantic recognition model constructed according to the method of any one of claims 1-4, and obtaining a coding result;

and determining the semantic label with the maximum label score as the semantic label of the voice feature in the nth dimension from the label scores respectively corresponding to all the semantic labels, wherein i is a positive integer, n is a positive integer larger than 2, and the target semantic label in the first dimension is directly obtained according to the voice feature. In addition, the present application further provides a computer storage medium, where one or more programs are stored, and the one or more programs may be executed by the speech semantic recognition apparatus according to the fifth aspect of the present application, so as to implement the speech semantic recognition model building method provided by the foregoing embodiments of the present application or the semantic recognition method provided by the foregoing embodiments of the present application.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for constructing a speech semantic recognition model is characterized by comprising the following steps:

extracting voice features from the voice sample signal;

2. The method according to claim 1, wherein determining whether the speech semantic recognition model meets a preset requirement according to the first conditional probability and the second conditional probability specifically comprises:

3. The method according to claim 1 or 2, wherein the randomly selecting feature values from the speech features according to a preset selection rule to perform a hidden code operation specifically comprises:

4. The method according to claim 1 or 2, wherein before the encoding layer encodes the speech feature subjected to the steganographic operation and obtains the encoding result, the method further comprises:

5. A method of semantic recognition, the method comprising:

extracting voice features from a voice signal to be recognized;

6. A speech semantic recognition model construction device is characterized by comprising the following steps:

the input module is used for inputting the voice features subjected to the hidden code operation into a pre-constructed voice semantic recognition model, wherein the pre-constructed voice semantic recognition model comprises: the decoding device comprises a coding layer, a first decoding layer and a second decoding layer;

7. The apparatus of claim 6, wherein the crypto module comprises:

the generating unit is used for generating a voice characteristic spectrogram according to the voice characteristics;

8. An apparatus for speech semantic recognition, the apparatus comprising:

an input module, configured to input the speech features into an encoding layer of the speech semantic recognition model constructed by the apparatus according to claim 6 or 7, and obtain an encoding result;

9. A speech semantic recognition device, characterized by comprising:

a processor and a memory, wherein the processor is used for executing an application program starting program stored in the memory to realize the speech semantic recognition model construction method according to any one of claims 1 to 4 or the semantic recognition method according to claim 5.

10. A computer storage medium storing one or more programs executable by the speech semantic recognition apparatus according to claim 9 to implement the speech semantic recognition model construction method according to any one of claims 1 to 4 or the semantic recognition method according to claim 5.