CN113571052A

CN113571052A - Noise extraction and instruction identification method and electronic equipment

Info

Publication number: CN113571052A
Application number: CN202110832253.5A
Authority: CN
Inventors: 米良; 黄海荣; 李林峰
Original assignee: Hubei Ecarx Technology Co Ltd
Current assignee: Ecarx Hubei Tech Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-10-29

Abstract

The embodiment of the invention provides a noise extraction and instruction identification method and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring target text information corresponding to the target voice data; inputting the target text information into a pre-trained noise identification model to obtain the prediction probability of mapping the target text information to each preset noise label; the preset noise label is used for representing an index position of a predicted noise text, the predicted noise text is a word group in the target text information, and the word group is a combination of one word or a plurality of continuous words in the target text information; and determining the predicted noise text corresponding to the preset noise label with the maximum prediction probability as the target noise text. Compared with the prior art, the scheme provided by the embodiment of the invention can be used for extracting the noise in the text to be subjected to natural language processing without using the stop word list.

Description

Noise extraction and instruction identification method and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a noise extraction and instruction identification method and electronic equipment.

Background

Currently, with the continuous development of artificial intelligence algorithms, natural language processing tasks are increasingly required, such as named entity recognition, intent recognition, and the like. According to the language habit of the user, the text to be processed expressed by the user may contain meaningless phrases, for example, the text to be processed is: "I want to go to Xunhui building without knowing when" where "without knowing when" is a meaningless phrase.

Usually, meaningless phrases included in a text are considered as noise in the text, and in a natural language processing task, the noise in the text to be processed may affect the accuracy of the resulting processing result.

Therefore, in order to improve the accuracy of the processing result of the natural language processing task, it is necessary to extract the noise in the text to be processed, and further, perform natural language processing using the text to be processed, which is obtained after the noise extraction and does not include the noise, so as to improve the accuracy of the obtained processing result.

In the related art, the way of extracting noise in a text to be processed is as follows: and pre-constructing a stop word list comprising a plurality of noises, comparing each phrase in the text to be processed with each noise in the stop word list so as to determine the noise in the text to be processed, and further extracting the determined noise.

However, in the above-described related art, since the number of noises included in the deactivation vocabulary is limited, all noises that may be included in the text to be processed cannot be enumerated, and thus, in many cases, the noises in the text to be processed cannot be extracted.

Disclosure of Invention

The embodiment of the invention aims to provide a noise extraction and instruction identification method and electronic equipment, so as to extract noise in a text to be subjected to natural language processing without using a stop word list. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a noise extraction method, where the method includes:

acquiring target text information corresponding to the target voice data;

inputting the target text information into a pre-trained noise identification model to obtain the prediction probability of mapping the target text information to each preset noise label; the preset noise label is used for representing an index position of a predicted noise text, the predicted noise text is a word group in the target text information, and the word group is a combination of one word or a plurality of continuous words in the target text information;

and determining the predicted noise text corresponding to the preset noise label with the maximum prediction probability as the target noise text.

Optionally, in a specific implementation manner, each preset noise tag is determined by a preset text length, and a generation manner of each preset noise tag includes:

determining position labels of all bits forming a preset text length;

and taking the position label of any one bit or the position labels of any continuous multiple bits as a preset noise label.

Optionally, in a specific implementation manner, the obtaining of the target text information corresponding to the target voice data includes:

acquiring a voice data text corresponding to target voice data;

if the length of the voice data text is equal to the preset text length, sequentially filling the target text information into each position forming the preset text length to obtain the target text information;

if the length of the voice data text is larger than the preset text length, acquiring text information which starts from a first word and has a length equal to the preset text length in the voice data text, and sequentially filling each bit forming the preset text length to obtain the target text information;

if the length of the voice data text is smaller than the preset text length, adding at least one designated character after the last character of the voice data text, and sequentially filling each position forming the preset text length to obtain the target text information; and the sum of the length of the voice data text and the length of the at least one designated character is the preset text length.

Optionally, in a specific implementation manner, the inputting the target text information into a pre-trained noise recognition model, and obtaining the prediction probability that the target text information is mapped to each preset noise label includes:

inputting the target text information into a feature extraction network in a noise identification model to obtain target features of the target text information;

and inputting the target characteristics into a classification network in the noise identification model to obtain the prediction probability of mapping the characteristics of the target text information to each preset noise label.

Optionally, in a specific implementation manner, the feature extraction network includes: the device comprises an input layer, a character embedding layer, a convolution layer, an activation layer, a pooling layer and a fusion layer;

the input layer is used for generating a target array corresponding to the target text information; wherein, each element in the target array is: the index value of each word in the target text information;

the word embedding layer is used for generating a coding matrix corresponding to the target array; wherein, each element in the coding matrix is: a word vector for a word characterized by each index value in the target array;

the convolutional layer is used for respectively extracting the characteristics of the coding matrix by utilizing various convolutional kernels to obtain a plurality of initial characteristic matrixes of the target text information;

the activation layer is used for respectively activating each initial feature matrix by using a preset activation function to obtain a plurality of activation feature matrices of the target text information;

the pooling layer is used for respectively compressing each activated feature matrix by preset dimensionality according to a preset down-sampling mode to obtain a plurality of down-sampling feature matrices of the target text information after the dimensionality is compressed;

and the fusion layer is used for fusing the plurality of down-sampling feature matrixes to obtain a target feature matrix of the target text information as a target feature of the target text information.

Optionally, in a specific implementation manner, the classification network includes: a fully connected layer and a normalization layer;

the full connection layer is used for calculating an initial probability matrix by using the target characteristic matrix; each element in the initial probability matrix is used for representing that a phrase in the target text information corresponding to each preset noise label is an initial probability value of a target noise text;

the normalization layer is used for normalizing each element in the initial probability matrix to obtain a target probability matrix of the target text information; wherein, each element in the target probability matrix is: and mapping the target text information to the prediction probability of each preset noise label.

Optionally, in a specific implementation manner, the training manner of the noise recognition model includes:

acquiring preset sample text information added with noise labels; wherein the noise label is an index position of the noise text in the sample text information;

for each sample text message, inputting the sample text message into an initial model to be trained, and obtaining the probability of mapping the sample text message to each preset noise label;

if the preset noise label with the maximum probability is matched with the noise label of the sample text information, training the next sample text information;

and if the preset noise label with the maximum probability is not matched with the noise label of the sample text information, adjusting the parameters of the initial model, returning to the step of inputting the sample text information into the initial model to be trained and obtaining the probability of mapping the sample text information to each preset noise label until the initial model converges.

Optionally, in a specific implementation manner, the method further includes:

deleting a target noise text in the target text information to obtain text information to be processed;

and according to a preset processing mode, performing natural language processing on the text information to be processed to obtain a processing result related to the text information to be processed.

In a second aspect, an embodiment of the present invention provides an instruction identification method, where the method includes:

determining noise text information in the instruction text information corresponding to the target instruction by using any noise extraction method provided by the first aspect;

deleting the noise text information in the instruction text information to obtain text information to be identified;

inputting the text information to be recognized into a pre-trained intention recognition model to obtain the intention of a target user represented by the text information to be recognized;

inputting the text information to be recognized into a pre-trained named entity recognition model to obtain a target named entity recognition result of the text information to be recognized;

executing the target instruction based on the target user intent and the target named entity recognition result.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor configured to implement the steps of any one of the methods provided in the first and second aspects when executing the program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the methods provided in the first and second aspects.

In a fifth aspect, the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the steps of implementing any of the methods provided in the first and second aspects.

The embodiment of the invention has the following beneficial effects:

therefore, the noise recognition model can be trained in advance by applying the scheme provided by the embodiment of the invention. After the target voice data is obtained, the target text information corresponding to the target voice data can be obtained first, and then the obtained target text information can be input into the pre-trained noise identification model, so that the prediction probability of the target text information mapping to each preset noise label is obtained. Therefore, the predicted noise text corresponding to the preset noise label with the maximum prediction probability can be determined as the target noise text.

Based on this, by applying the scheme provided by the embodiment of the present invention, each preset noise label corresponds to an index position of one predicted noise text, and each predicted noise text is a word group in the target text information, so that the obtained prediction probability of the target text information mapped to each preset noise label is the prediction probability of each word group in the target text information as the target noise text. Therefore, for target text information corresponding to the target voice data, a pre-trained noise recognition model can be directly utilized to obtain the prediction probability of each phrase in the target text information as the target noise text, and the phrase with the maximum prediction probability is determined as the target noise text. Therefore, when the noise in the text to be processed is extracted, the noise in the text can be determined without the help of a stop word list, and the text after the noise is extracted is obtained, so that the accuracy of the obtained processing result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

Fig. 1 is a schematic flow chart of a noise extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a generation manner of each preset noise tag;

FIG. 3(a) is a schematic diagram of position labels forming bits of a predetermined text length;

fig. 3(b) is a schematic diagram of all preset noise labels corresponding to fig. 3 (a);

FIG. 4(a) is a schematic diagram of a target text message generated using bits forming a preset text length;

fig. 4(b) is a schematic diagram of a predicted noise text corresponding to each preset noise tag on the basis of the target text information shown in fig. 4 (a);

FIG. 5 is a flowchart illustrating an embodiment of S102 in FIG. 1;

fig. 6 is a schematic structural diagram of a noise identification model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the convolutional layer of FIG. 6;

FIG. 8 is a diagram of an activation function ReLu;

fig. 9 is a schematic flow chart of another noise extraction method according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of the structure of an intent recognition model;

FIG. 11 is a schematic diagram of a named entity recognition model;

fig. 12 is a schematic flowchart of a specific identification method according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In the related art, the way of extracting noise in a text to be processed is as follows: and pre-constructing a stop word list comprising a plurality of noises, comparing each phrase in the text to be processed with each noise in the stop word list so as to determine the noise in the text to be processed, and further extracting the determined noise. Since the number of noises included in the deactivation word list is limited, all noises that may be included in the text to be processed cannot be enumerated, and thus, in many cases, the noises in the text to be processed cannot be extracted.

In order to solve the above technical problem, an embodiment of the present invention provides a noise extraction method.

The noise extraction method may be applied to various types of electronic devices, such as a desktop computer, a notebook computer, a mobile phone, and the like, for which, embodiments of the present invention are not limited specifically, and the following is referred to as electronic devices for short.

In addition, the noise extraction method can be applied to various scenes needing to execute natural language processing tasks. For example, scenes that enable voice control of a device are identified with intent; for example, a scene in which an NER (Named Entity Recognition) tag is added to a text is recognized by a Named Entity. This is all reasonable.

The noise extraction method provided by the embodiment of the invention can comprise the following steps;

acquiring target text information corresponding to the target voice data;

Hereinafter, a noise extraction method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a noise extraction method according to an embodiment of the present invention, as shown in fig. 1, the method may include the following steps S101 to S103:

s101: acquiring target text information corresponding to the target voice data;

in general, when natural speech processing such as intention recognition and named entity recognition is performed, the data to be processed is the speech data of the user, and when natural speech processing is performed on the speech data of the user, the speech data may be first converted into text information.

For example, in the voice control process of the smart device, the smart device needs to convert the voice command of the user into text information, so as to perform intention recognition and named entity recognition on the text information, and execute the voice command of the user according to the recognition result.

Based on this, when the noise extraction method provided by the embodiment of the present invention is executed, first, target text information corresponding to target speech data is obtained.

The target text information can be acquired in various ways, for example, target voice data of a user can be directly acquired, and then the target voice data is converted into the target text information by means of a voice conversion algorithm and the like; for another example, after the target voice data of the user is collected by the other device and converted into the target text information, the target text information and the like may be acquired from the other device. This is all reasonable.

S102: inputting target text information into a pre-trained noise identification model to obtain the prediction probability of mapping the target text information to each preset noise label;

the preset noise label is used for representing an index position of a predicted noise text, the predicted noise text is a word group in the target text information, and the word group is a combination of one word or a plurality of continuous words in the target text information;

in general, the noise in the target text information may be a certain word in the target text information, or a combination of consecutive words in the target text information.

For example, the target text information is: when I want to go to the Xunhui building, what is not known is the noise in the target text information.

In this way, when extracting the noise in the target text information, the possibility that each word group in the target text information can be used as the noise in the target text information can be determined first. Each word group in the target text information may be a word or a combination of consecutive words in the target text information.

For example, the target text information is: when I want to go to Xunhui buildings, what is not known, the phrases in the target text information include: i, I want to go, xu converge a building, etc.

Since the number of each phrase in different target text information is different due to different lengths of different target text information, each preset noise label may be generated in advance in order to obtain all phrases that may be used as noise in the target text information from any one target text information.

The preset noise tags are used for representing index positions of the predicted noise texts, and the predicted noise texts are word groups in the target text information, that is, each preset noise tag corresponds to one word group in the target text information, and the positions of the word groups in the target text information are matched with the index positions of the predicted noise texts represented by the preset noise tags, so that each preset noise tag corresponds to one word group in the target text information.

Thus, for each preset noise label, the prediction probability of the target text information mapped to the preset noise label is as follows: and taking the phrase of which the position is matched with the index position of the predicted noise text represented by the preset noise label in the target text information as the prediction probability of the target noise text in the target text information.

That is to say, the prediction probability of mapping the target text information to each preset noise label is the prediction probability that a word group in the target text information corresponding to each preset noise label is the target noise text, so that the prediction probability of mapping the target text information to each preset noise label is obtained, that is, the prediction probability that each word group in the target text information is the target noise text can be obtained.

For the sake of clear text, the following will specifically describe the generation manner of each preset noise label.

Based on the method, after the target text information is obtained, the target text information can be input into a noise identification model trained in advance, and the prediction probability of mapping the target text information to each preset noise label is obtained.

That is, with the noise recognition model, the prediction probability that each word group in the target text information is the target noise text can be determined.

S103: and determining the predicted noise text corresponding to the preset noise label with the maximum prediction probability as the target noise text.

Since the prediction probability of mapping the target text information to each preset noise label is the prediction probability of each word group in the target text information being the target noise text, the larger the prediction probability of mapping the target text information to the preset noise label is for each preset noise label, the more likely the word group in the target text information corresponding to the preset noise label is to be the target noise text.

In this way, the phrase in the target text information corresponding to each preset noise label is as follows; predicting noise texts corresponding to each preset noise label; therefore, after the prediction probabilities of the target text information mapped to the preset noise labels are obtained, the preset noise label with the maximum prediction probability can be determined, and therefore the predicted noise text corresponding to the preset noise label with the maximum prediction probability is determined as the target noise text.

That is to say, the predicted noise text corresponding to the preset noise tag with the highest prediction probability is the noise extracted from the target text information, where the predicted noise text corresponding to the preset noise tag with the highest prediction probability is a word group in the target text information, and the position of the word group in the target text information matches with the index position of the predicted noise text corresponding to the preset noise tag with the highest prediction probability.

Optionally, after the target noise text is determined, the target noise text in the target text information may be deleted, so as to obtain the target text information to be processed, which does not include noise. Thus, the obtained target text information to be processed without noise can be further processed by natural voice to obtain a corresponding processing result.

For example, intention recognition may be performed on target text information to be processed that does not include noise, to obtain a user intention of the target text information; for another example, the named entity recognition may be performed on the target text information to be processed that does not include noise, so as to obtain a named entity recognition result of the target text information.

Optionally, in a specific implementation manner, each preset noise tag is determined by a preset text length; fig. 2 is a schematic flow chart of a generation manner of each preset noise tag, as shown in fig. 2, the generation manner may include the following steps:

s201: determining position labels of all bits forming a preset text length;

s202: and taking the position label of any one bit or the position labels of any continuous multiple bits as a preset noise label.

The number of words included in target text information corresponding to different target speech data is different due to different language habits and language abilities of different users, and when the prediction probabilities of different target text information mapped to the preset noise labels are obtained by using the pre-trained noise recognition model, model parameters in the noise recognition model are determined, so that the prediction probabilities of different target text information mapped to the preset noise labels are obtained in order to uniformly use the noise recognition model, and therefore, the number and the structure of the used preset noise labels are consistent for different target text information.

Thus, the text length may be preset in consideration of the number of words included when a typical user speaks at one time, for example, the text length may be 70 since a typical person speaks at one time with the number of words not exceeding 70 words.

After the preset text length is obtained, the position labels of the bits forming the preset text length can be determined, and then the position label of any bit or the position labels of any continuous multiple bits can be used as a preset noise label.

That is to say, after the preset text length is obtained, a position label may be set for each position in the preset text length, so that each position label may be used as a preset noise label, and the position labels of each group of multiple continuous positions may also form a preset noise label.

Optionally, the position index of the first bit in each bit forming the preset text length may be determined as 1, so that, according to the arrangement sequence of each bit in the preset text length, the position index of each bit is determined as the sum of the position index of the previous bit of the bit plus 1 from the second bit.

That is, the position numbers may be referred to in order of the bits forming the preset text length from the first bit among the bits forming the preset text length in order of the natural numbers starting from 1.

For example, assuming that the preset text length is 70, as shown in fig. 3(a), the numbers 1 to 70 in the figure are position labels of the respective bits forming the preset text length, so that the position label of any one bit can be used as a preset noise label, for example, position label 1, position label 2, position label 3, etc.; the position labels of any one of the plurality of consecutive positions may also form a preset noise label, for example, the preset noise labels 1-2 composed of the position labels 1 and 2, the preset noise labels 1-3 composed of the position labels 1,2 and 3, and so on.

In this way, after all the position labels are respectively determined as the preset noise labels, and all the combinations of the position labels of a plurality of consecutive bits are respectively determined as the preset noise labels, all the preset noise labels as shown in fig. 3(b) can be obtained.

Considering that there are various possibilities in the number relationship between the length of the voice data text corresponding to the target voice data and the preset text length, and considering the accuracy of the output result of the noise recognition model, it is necessary to make the length of the target text information input into the noise recognition model the same as the preset text length, and when acquiring the target text information, it may be necessary to perform text deletion or text addition on the voice data text corresponding to the target voice data.

Based on this, optionally, in a specific implementation manner, on the basis of the specific implementation manner shown in fig. 2, in the step S101, acquiring the target text information corresponding to the target voice data may include the following steps 11 to 14:

step 11: acquiring a voice data text corresponding to target voice data;

step 12: if the length of the voice data text is equal to the preset text length, sequentially filling the voice data text into each bit forming the preset text length to obtain target text information;

step 13: if the length of the voice data text is larger than the preset text length, acquiring text information which starts from a first word and has the length equal to the preset text length in the voice data text, and sequentially filling the text information to form each bit of the preset text length to obtain target text information;

step 14: if the length of the voice data text is smaller than the preset text length, adding at least one designated character after the last character of the voice data text, and filling each position forming the preset text length in sequence to obtain target text information;

and the sum of the length of the voice data text and the length of the at least one designated character is a preset text length.

In this specific implementation manner, the target voice data may be obtained first, and then, the target voice data is converted into text data by means of a voice conversion algorithm or the like, so as to obtain a voice data text corresponding to the target voice data. Therefore, the method for processing the voice data text can be determined according to the number relation between the length of the voice data text and the length of the preset text, so as to obtain the target text information with the length being the length of the preset text.

If the length of the voice data text is equal to the preset text length, it may be determined that each word in the voice data text corresponds to one of the bits forming the preset text length, and the first bit to the last bit of the bits forming the preset text length may be sequentially filled with each word in the voice data text from the first word in the voice data text, so that a combination of the bits forming the preset text length filled with the words is the target text information.

If the length of the voice data text is greater than the preset text length, it may be determined that some words in the voice data text cannot be filled in each bit forming the preset text length, and thus, each word whose length exceeds the preset text length in the voice data text may be discarded, that is, text information in the voice data text, which starts from a first word and whose length is equal to the preset text length, may be obtained, and then, each word in the obtained text information is sequentially filled in first to last bits in each bit forming the preset text length, and thus, a combination of each bit forming the preset text length filled in words is the target text information.

For example, when the length of the voice data text is N and the preset text length is P (P < N), then the P +1 th to nth words in the voice data text may be discarded, that is, the 1 st to pth words in the voice data text are obtained. And then, sequentially filling the 1 st character to the P th character in the acquired voice text information into the first position to the P th position of each position forming the preset text length, wherein the combination of the P positions of the filled characters is the target text information.

If the length of the voice data text is smaller than the preset text length, it can be determined that some bits in each bit forming the preset text length cannot be filled by the words in the voice data text, and therefore, at least one designated character can be added after the last word of the voice data text according to the difference between the length of the voice data text and the preset text length, so that the sum of the length of the voice data text and the number of the added at least one designated character is the preset text information. In this way, the first to last bits of the bits forming the preset text length can be filled with each word in the voice data text and the added at least one designated character in sequence from the first word of the voice data text, and thus, the combination of the filled words or the designated characters of the bits forming the preset text length is the target text information.

For example, when the length of the voice data text is M and the preset text length is P (P > M), P-M designated characters may be added after the mth word of the voice data text, and then, starting from the 1 st word of the voice text information, the M words of the voice data text and the added P-M designated characters are sequentially filled in from the first bit to the pth bit of each bit forming the preset text length, and then the combination of the P bits of the filled-in words is the target text information.

Optionally, the designated character may be: null characters.

For example, assuming that the preset text length is 70, taking the speech data text "i want to remit to the building and do not know what" as an example, the target text information may be determined as shown in fig. 4(a), and the predicted noise text corresponding to each preset noise label is shown in fig. 4 (b).

Optionally, in a specific implementation manner, as shown in fig. 5, the step S102 of inputting the target text information into a pre-trained noise recognition model to obtain the prediction probability of the target text information mapped to each preset noise label may include the following steps S1021 to S1022:

s1021: inputting the target text information into a feature extraction network in a noise identification model to obtain target features of the target text information;

s1022: and inputting the target characteristics into a classification network in the noise identification model to obtain the prediction probability of mapping the characteristics of the target text information to each preset noise label.

In this specific implementation manner, the noise recognition model may include a feature extraction network and a classification network, and when the target text information is input to the noise recognition model, the target text information is input to the feature extraction network in the noise recognition model, and then the feature extraction network may extract a target feature of the target text information and input the target feature; furthermore, the target characteristics of the target text information can be input into the classification network in the noise identification model, so that the classification network can obtain the prediction probability of the characteristic mapping of the target text information to each preset noise label by using the target characteristics of the target text information.

Optionally, in a specific implementation manner, on the basis of the specific implementation manner shown in fig. 5, the feature extraction network may include: input layer, word embedding layer, convolution layer, activation layer, pooling layer and fusion layer.

the word embedding layer is used for generating a coding matrix corresponding to the target array; wherein, each element in the coding matrix is: a word vector for the word represented by each index value in the target array;

the convolution layer is used for respectively extracting the characteristics of the coding matrix by utilizing various convolution kernels to obtain a plurality of initial characteristic matrixes of the target text information;

the activation layer is used for respectively activating each initial feature matrix by utilizing a preset activation function to obtain a plurality of activation feature matrices of the target text information;

and the fusion layer is used for fusing the plurality of down-sampling feature matrixes to obtain a target feature matrix of the target text information as the target feature of the target text information.

Optionally, in a specific implementation manner, on the basis of the specific implementation manner shown in fig. 5, the classification network may include: a fully connected layer and a normalization layer;

the full connection layer is used for calculating an initial probability matrix by utilizing the target characteristic matrix; each element in the initial probability matrix is used for representing a phrase in the target text information corresponding to each preset noise label as an initial probability value of the target noise text;

Based on this, optionally, in a specific implementation manner, as shown in fig. 6, the noise identification model may include: the device comprises an input layer, a character embedding layer, a convolution layer, an activation layer, a pooling layer, a fusion layer, a full-connection layer and a normalization layer; the input layer, the character embedding layer, the convolution layer, the activation layer, the pooling layer and the fusion layer form a feature extraction network of the noise recognition model, and the full connection layer and the normalization layer form a classification network of the noise recognition model. In this specific implementation, specifically:

(1) an input layer:

after the target text information is acquired, the target text information can be input into the pre-trained noise recognition model, that is, the target text information is input into an input layer in the noise recognition model.

In the above noise recognition model, an index value for each word is set in advance in an input layer. Optionally, the format of the index value of each word may be a one-hot format.

In this way, after receiving the target text information, the input layer may first determine the index value of the one-hot format of each word in the target text information in sequence, starting with the first word of the target text information. Further, the output layer may generate a target array corresponding to the target text information. Optionally, each index value in the generated target array may be an integer value.

Wherein, each element in the target array is: the index value of each word in the target text information, and the number of the included index values is equal to the number of the words included in the target text information.

Optionally, when the input layer generates the target array corresponding to the target text information, the number relationship between the length of the target text information and the preset text length may be determined first.

When the length of the target text information is equal to the preset text length, the input layer can directly determine an index value of each character in the target text information in a one-hot format, and further generate a target array corresponding to the target text information;

when the length of the target text information is greater than the preset text length, the input layer may discard each word of which the length exceeds the preset text length in the target text information, and further, the input layer may determine an index value of one-hot format of each remaining word in the target text information, and further, generate a target array corresponding to the target text information;

when the length of the target text information is smaller than the preset text length, after determining the index value of each word in the target text information in the one-hot format, the input layer may add at least one preset character after determining the index value of the last word in the one-hot format, so that the sum of the length of the target text information and the added at least one preset character is the preset text length, and thus, an array formed by the index value of each word in the one-hot format and the added at least one preset character in the target text information is the target array corresponding to the target text information.

In this way, the number of elements included in the obtained target array is the preset text length.

After the target array is generated, the input layer may output the target array, thereby inputting the target array to the word embedding layer.

For example, when the preset text length is 70, the input layer may output a target array including 70 elements.

(2) Word embedding layer:

by word embedding is meant that each word is represented by a multidimensional data. For example, each word is represented by a one-dimensional array including a plurality of elements, wherein each element is a number, and illustratively, each word may be represented by a one-dimensional array including 32 elements, that is, each word may be represented by a one-dimensional array including 32 numbers.

In this way, since the index value of each word may characterize one word and each word corresponds to one word vector, so that the index value of each word corresponds to one word vector, the word embedding layer may determine the word vectors of the words characterized by the respective index values in the resulting target array, so that, based on the determined respective word vectors, the encoding matrix corresponding to the target array is generated, and the respective elements in the encoding matrix are the word vectors of the words characterized by the respective index values in the target array. And determining the number of elements included in the word vector corresponding to each index value, wherein the number of the elements included in the word vector corresponding to each determined index value is a preset number.

Further, after the code matrix is generated, the word embedding layer may output the code matrix as an output, and input the code matrix to the convolutional layer.

Illustratively, when the preset text length is 70 and each word is represented by a one-dimensional array including 32 elements, the word embedding layer may output an encoding matrix with a dimension [70,32 ].

(3) And (3) rolling layers:

the essence of the convolutional layer is to perform feature extraction on input data by using a preset kernel function, namely a convolutional kernel, so as to obtain extracted features. Here, the convolution calculation can be understood as a multiply-accumulate process. For example, as shown in fig. 7, feature extraction is performed on the input data on the left side by using the middle convolution kernel to obtain the feature of the output on the right side, and the calculation formula in fig. 7 can be a convolution calculation mode in the feature extraction process.

In the above noise identification model, the convolutional layer functions to amplify and extract features in the input coding matrix. When the convolutional layer performs feature extraction, several continuous words may be analyzed as a whole, for example, 3 continuous words may be used as a whole, 4 continuous words may be used as a whole, and 5 continuous words may be used as a whole.

When a plurality of continuous words can form words or phrases, the plurality of continuous words can be taken as a whole to be subjected to feature extraction; when a plurality of continuous characters are all single words, feature extraction can be performed according to the context of each character in the plurality of continuous characters.

Specifically, a plurality of convolution kernels may be set according to the number of words included in a plurality of consecutive words analyzed as a whole, so that each convolution kernel is used to perform feature extraction on the input encoding matrix, respectively, to obtain a plurality of initial feature matrices of the target text information.

Further, after the plurality of initial feature matrices are generated, the convolutional layer may output the plurality of initial feature matrices and input the plurality of initial feature matrices to the active layer.

Illustratively, when the input encoding matrix is a matrix with a dimension of [70,32], three convolution kernels including convolution kernels [3,32], [4,32] and [5,32] exist, and the number of each convolution kernel is 128, 128 matrices with a dimension of [68,1], 128 matrices with a dimension of [67,1] and 128 matrices with a dimension of [66,1] can be obtained. Furthermore, 128 matrixes corresponding to each convolution kernel are combined to obtain three matrixes with the dimensionalities of [68,128], [67,128] and [66,128], and the three matrixes with the dimensionalities of [68,128], [67,128] and [66,128] are three initial feature matrixes of the target text information. The convolutional layer can output the initial feature matrix with three dimensions of [68,128], [67,128], and [66,128], respectively.

(4) Active layer

Because each layer such as the convolution layer and the full link layer in the noise identification model can not bring nonlinear characteristics to the noise identification model, the essence of the noise identification model is as follows: input data is transformed into desired output data, and a noise recognition model without nonlinear characteristics cannot realize the data transformation, so that the nonlinear characteristics need to be brought to the noise recognition model by using an activation function in an activation layer.

Based on the method, the activation layer can respectively activate each initial feature matrix by using a preset activation function to obtain a plurality of activation feature matrices of the target text information. And the activation layer may not change the dimension of each initial feature matrix for activation. That is, the plurality of initial feature matrices of the target text information have the same dimension as the plurality of activated feature matrices of the target text information.

Optionally, the preset activation function may be a reli (Rectified Linear Unit). Fig. 8 is a schematic diagram of ReLu, in which the horizontal axis represents function input and the vertical axis represents function output. Moreover, the dimensions of each initial feature matrix may not be changed by activating each initial feature matrix with ReLu, respectively.

Further, after obtaining the plurality of activation feature matrices, the activation layer may output the plurality of activation feature matrices and input the plurality of activation feature matrices to the pooling layer.

Illustratively, when the target text information has three activation feature matrices with dimensions [68,128], [67,128], and [66,128], respectively, the activation layer may output the activation feature matrices with three dimensions [68,128], [67,128], and [66,128], respectively.

(5) Pooling layer

The purpose of the pooling layer is: insignificant features among the features extracted by the convolutional layer are ignored and the means used by the pooling layer is "downsampling".

Based on this, in the noise recognition model, the purpose of the pooling layer is to: and compressing each activation characteristic matrix by using a preset pooling mode, namely a down-sampling mode, according to preset dimensionality, so as to ignore certain unimportant characteristics of the target text information represented by each activation characteristic matrix.

Alternatively, the pooling may be an average pooling or a maximum pooling.

Wherein, the average pooling means: and calculating the average value of all data in the original data with the preset size, and replacing the original data with the calculated average value, thereby achieving the purposes of reducing the dimensionality of the original data and retaining the data characteristics of the original data. For example, if the preset size is 4 × 4, 16 pieces of data exist in the original data, and the average pooling becomes 1 piece of data, so that the original data can be considered to be reduced by 16 times.

The maximum pooling means that: and replacing the original data with the maximum data in the original data with preset size. For example, if the preset size is 4 × 4, and there are 16 pieces of data in the original data, the maximum value among the 16 pieces of data is determined, and the 16 pieces of data are replaced with the maximum value, that is, the maximum pooling is changed to 1 piece of data, and thus, it can be considered that the original data is reduced by 16 times.

Further, after obtaining the plurality of down-sampled feature matrices, the pooling layer may output the plurality of down-sampled feature matrices and input the plurality of down-sampled feature matrices to the fusion layer.

Illustratively, when the target text information has three activated feature matrices with dimensions [68,128], [67,128], and [66,128], respectively, assuming that the pooling sizes for pooling the three down-sampled feature matrices are 68, 67, and 66, respectively, the pooling layer may output down-sampled feature matrices with three dimensions [1,128], and [1,128], respectively.

(6) Fusion layer

And fusing the plurality of downsampling feature matrices of the target text information to obtain a target feature matrix of the target text information, wherein the dimensionality of the target feature matrix of the target text information is [1, K ], and K is determined according to the dimensionality of the plurality of downsampling feature matrices of the target text information. Further, the obtained target feature matrix is used as an output, and the target feature matrix is input to the full link layer.

Illustratively, when the target text information has three down-sampled feature matrices with dimensions [1,128], and [1,128], respectively, then the fusion layer may output the target feature matrix with dimension [1,384 ].

(7) Full connection layer:

the full connection layer is used for projecting the target characteristic matrix of the target text information to the dimension of each preset noise label. The full connection layer comprises a preset weight matrix, the dimensionality of the weight matrix is [ K, M ], wherein K is the same as K in the dimensionality [1, K ] of the target feature matrix of the target text information, and M is the number of each prediction noise label.

Wherein, assuming that the preset text length is T, M ═ T (T + 1)/2.

That is, the full link layer may calculate an initial probability matrix using the target feature matrix; and each element in the initial probability matrix is used for representing a phrase in the target text information corresponding to each preset noise label as an initial probability value of the target noise text.

The full link layer may calculate the initial probability matrix using the following formula.

Y＝X*W+B；

Wherein, Y is an initial probability matrix, and the dimensionality of the initial probability matrix is [1, K ]; x is a target characteristic matrix of target text information, and the dimensionality of the target characteristic matrix is [1, K ]; w is a weight matrix obtained by pre-training, and the dimensionality of the weight matrix is [ K, M ]; b is a pre-trained bias array, which is a one-dimensional array comprising M elements.

Each element in the initial probability matrix can be used to represent a phrase in the target text information corresponding to each preset noise label as an initial probability value of the target noise text. That is to say, each element in the calculated initial probability matrix corresponds to one preset noise label, and thus each element is the semantic of the preset noise label corresponding to the element, that is, each element is the semantic of the predicted noise text corresponding to the preset noise label in the target text information, and thus, the element can represent the possibility that the phrase in the target text information corresponding to the preset noise label is the target noise text.

After obtaining the initial probability matrix, the fully-connected layer may take the initial probability matrix as an output, and thus input the initial probability matrix to the normalization layer.

For example, when the length of the preset text information is 70 and the dimension of the target feature matrix of the target text information is [1,384], the full link layer may output an initial probability matrix having the dimension of [1,2485 ].

(8) Normalization layer

The normalization layer is used for normalizing the element values of the elements in the initial probability matrix through equal-scale reduction, so that the elements in the initial probability matrix are converted into the probability in a percentage form, a target probability matrix of the target text information is obtained, and the sum of the elements in the target probability matrix is 1.

Wherein, each element in the target probability matrix is: and mapping the target text information to the prediction probability of each preset noise label.

When the output dimension of the full connection layer is [1, M ]]When the initial probability matrix is obtained, each element in the initial probability matrix can be reduced into a probability value in a percentage form in an equal proportion manner, so that a target probability matrix of the target text information is obtained. Wherein, each element in the target probability matrix of the target text information may be C respectively₀、C₁、C₂、……、C_M-1. Wherein, C_XAnd mapping the target text information to the prediction probability of the Xth preset noise label.

Optionally, the normalization layer may use a softmax (normalization) function to normalize each element in the initial probability matrix of the target text information, so as to obtain the target probability matrix of the target text information.

Optionally, in a specific implementation manner, on the basis of the above specific implementation manners, the training manner of the noise recognition model includes the following steps 21 to 24:

step 21: acquiring preset sample text information added with noise labels, wherein the noise labels are index positions of the noise texts in the sample text information;

step 22: for each sample text message, inputting the sample text message into an initial model to be trained, and obtaining the probability of mapping the sample text message to each preset noise label; if the preset noise label with the maximum probability is matched with the noise label of the sample text information, executing step 23; if the preset noise label with the maximum probability is not matched with the noise label of the sample text information, executing step 24;

step 23: training the next piece of sample text information;

step 24: the parameters of the initial model are adjusted and the process returns to step 22 until the initial model converges.

The noise identification model may be obtained by training any type of electronic device, such as a laptop, a desktop, a tablet computer, and the like, but the embodiment of the present invention is not limited in particular, and is hereinafter referred to as a training device. The training device and the electronic device for executing the noise extraction method provided by the embodiment of the invention may be the same electronic device or different electronic devices. The electronic device for executing the noise extraction method provided by the embodiment of the invention is simply referred to as an executing device.

When the training device and the execution device are the same device, the noise recognition model can be obtained by training in the same electronic device, and further, the obtained noise recognition model is used for realizing the noise extraction method provided by the embodiment of the invention on the electronic device; when the training device and the execution device are not the same electronic device, the training device may send the obtained noise recognition model to the execution device after training to obtain the noise recognition model. Thus, after obtaining the noise identification model, the execution device may use the obtained noise identification model to implement the noise extraction method provided by the embodiment of the present invention.

The training device may first obtain preset sample text information added with a noise label, where the noise label of each sample text information is: the index position of the noise text in the sample text information.

The sample text information may be a sentence, or may be a phrase or phrase composed of a plurality of words, which is reasonable. And, the sample text information may be acquired in various ways. For example, sample text information stored in the local storage space may be directly obtained; sample text information may also be obtained from other non-local memory spaces. This is all reasonable.

In addition, in the embodiment of the invention, in order to ensure the accuracy of the noise recognition model obtained by training, a large amount of sample text information can be utilized in the training process of the noise recognition model. Accordingly, a plurality of sample text information can be acquired. The number of the sample text messages may be set according to the requirements in practical application, and is not specifically limited in the present invention. And the type of the sample text information may include only sentences, phrases or phrases, or may include at least two of the sentences, phrases and phrases. This is all reasonable.

After the sample text information is obtained, the noise text in each sample text information can be further determined by methods such as manual identification, and the index position of the noise text in the sample text information is used as the noise label of the sample text information where the noise text is located, so that the noise label is added to each sample text information.

Thus, for each sample text message, the sample text message can be input into the initial model to be trained, and the probability that the sample text message is mapped to each preset noise label is obtained. Further, the preset noise label with the maximum probability can be determined according to the obtained probability that the sample text information is mapped to each preset noise label, that is, the predicted value of the index position of the noise text in the sample text information is obtained.

For each sample text message, the noise label of the sample text message can be regarded as a true value of the index position of the noise text in the sample text message.

Therefore, for each sample text message, whether the initial model converges or not can be determined according to the matching degree between the predicted value and the true value of the index position of the noise text in the sample text message, and further, whether training can be stopped or not can be determined, so that the noise identification model can be obtained.

Based on this, for each sample text message, when the preset noise label with the maximum probability is matched with the noise label of the sample text message, it can be determined that the initial model converges for the sample text message, and thus, the next sample text message training can be performed.

The mode of training the next piece of sample text information is the same as the mode of training the piece of sample text information, and is not described herein again. Therefore, when the initial models are converged for all sample text information, the training of the noise recognition model can be determined to be finished, and the training can be stopped to obtain the trained noise recognition model.

When the preset noise label with the maximum probability is not matched with the noise label of the sample text information, it may be determined that the initial model has not converged, and thus the noise recognition model has not been trained yet and needs to be trained continuously, so that the parameters of the initial model may be adjusted, and the step 22 may be returned to, that is, the sample text information is input into the initial model after the parameters are adjusted, and the probability that the sample text information is mapped to each preset noise label is obtained again.

And circularly executing the process until the initial model converges, so as to obtain the noise identification model.

Generally, the purpose of noise extraction of target text information is to perform natural language processing on the target text information after noise removal to improve the accuracy of the obtained processing result, that is, noise extraction of the target text information is a preprocessing process for performing natural language processing on the target text information.

Based on this, optionally, in a specific implementation manner, as shown in fig. 9, the noise extraction method provided in the embodiment of the present invention may further include the following steps S104 to S105:

s104: deleting a target noise text in the target text information to obtain text information to be processed;

s105: and according to a preset processing mode, performing natural language processing on the text information to be processed to obtain a processing result related to the text information to be processed.

In this specific implementation manner, a processing manner of performing natural language processing on the target text information, for example, named entity recognition, intention recognition, and the like, may be set in advance according to requirements in practical applications.

Therefore, after the target noise text in the target text information is determined, the target noise text in the target text information can be deleted, the noise-removed target text information is obtained, and the text information to be processed is obtained. Furthermore, the natural language processing can be performed on the obtained text information to be processed according to a preset processing mode, so that a processing result about the text information to be processed is obtained. Wherein the obtained processing result can be used as the processing result of the target text information.

In this way, when the natural language processing is performed, the utilized text information to be processed is the target text information with the noise removed, so that the interference of the noise in the target text information on the natural language processing process can be reduced, and the accuracy of the obtained processing result of the target text information is improved.

Optionally, in a specific implementation manner, the step S105 may include the following step 31:

step 31: and inputting the text information to be processed into a pre-trained intention recognition model to obtain the user intention represented by the text information to be processed.

As shown in fig. 10, the intention recognition model includes: the device comprises an input layer, a character embedding layer, a convolution layer, a pooling layer, a fusion layer, a full-connection layer and an output layer.

In this specific implementation manner, the text information to be processed may be input into a pre-trained intent recognition model, so that the user intent represented by the text information to be processed is obtained through recognition of the text to be processed by the intent recognition model.

Optionally, in a specific implementation manner, the step S105 may include the following step 32:

step 32: and inputting the text information to be processed into a pre-trained named entity recognition model to obtain a named entity recognition result of the text information to be processed.

As shown in fig. 11, the named entity recognition model includes: the system comprises an input layer, a word embedding layer, a bidirectional long-short term memory network (LSTM) layer, a full connection layer, a Conditional Random Field (CRF) layer and an output layer.

In this specific implementation manner, the text information to be processed may be input into a pre-trained named entity recognition model, so that the named entity recognition result of the text information to be processed is obtained by recognizing the text information to be processed through the named entity recognition model.

Corresponding to the noise extraction method provided by the embodiment of the invention, the embodiment of the invention provides an instruction identification method.

Fig. 12 is a schematic flowchart of a specified recognition method according to an embodiment of the present invention, and as shown in fig. 12, the instruction recognition method may include the following steps S1201 to S1205:

s1201: determining a noise text in the instruction text information corresponding to the target instruction by using any noise extraction method provided by the embodiment of the invention;

s1202: deleting the noise text in the determined instruction text information to obtain text information to be identified;

s1203: inputting the text information to be recognized into a pre-trained intention recognition model to obtain the intention of a target user represented by the text information to be recognized;

s1204: inputting the text information to be recognized into a pre-trained named entity recognition model to obtain a target named entity recognition result of the text information to be recognized;

s1205: executing the target instruction based on the target user intention and the target named entity recognition result.

The user can generally control the smart device through various control instructions, for example, control the smart home device through voice. Therefore, in the control process, the intelligent device needs to perform intention identification and named entity identification on the control instruction sent by the user, so that the control instruction sent by the user is executed according to the obtained identification result.

In this way, after the target instruction is detected, the instruction text information corresponding to the target instruction may be obtained first, and further, in order to avoid an influence of noise existing in the instruction text information on an accuracy of a recognition result of the instruction text information, a noise text in the instruction text information may be determined by further using any one of the noise extraction methods provided in the embodiments of the present invention.

And then, deleting the determined noise text in the instruction text information to obtain the text information to be identified. Therefore, the text information to be recognized can be input into the pre-trained intention recognition model to obtain the intention of the target user represented by the text information to be recognized, and the text information to be recognized is input into the pre-trained named entity recognition model to obtain the recognition result of the target named entity of the text information to be recognized.

After obtaining the target user intention and the target named entity recognition result, the detected target instruction can be executed based on the target user intention and the target named entity recognition result.

Based on this, by applying the scheme provided by the embodiment of the present invention, when the target instruction is executed, the noise in the target instruction can be removed by any of the noise extraction methods provided by the above embodiments of the present invention, so that the accuracy of identifying the target instruction is improved, and the detected target instruction can be executed more accurately.

Corresponding to the noise extraction method and the instruction identification method provided by the above embodiments of the present invention, an embodiment of the present invention further provides an electronic device, as shown in fig. 13, including a processor 1301, a communication interface 1302, a memory 1303 and a communication bus 1304, where the processor 1301, the communication interface 1302 and the memory 1303 complete mutual communication via the communication bus 1304,

a memory 1303 for storing a computer program;

the processor 1301 is configured to, when executing the program stored in the memory 1303, implement the steps of any noise extraction method provided in the foregoing embodiment of the present invention, and/or implement the instruction identification method provided in the foregoing embodiment of the present invention.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industrial Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In a further embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program is executed by a processor to perform the steps of any one of the noise extraction methods provided in the above embodiments of the present invention, and/or the instruction identification method provided in the above embodiments of the present invention.

In a further embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of any of the noise extraction methods provided in the above-described embodiments of the present invention, and/or the instruction identification method provided in the above-described embodiments of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the electronic device embodiment, the computer-readable storage medium and the computer program product, since they are substantially similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of noise extraction, the method comprising:

acquiring target text information corresponding to the target voice data;

2. The method according to claim 1, wherein each preset noise label is determined by a preset text length, and the generating manner of each preset noise label comprises:

determining position labels of all bits forming a preset text length;

3. The method according to claim 2, wherein the obtaining target text information corresponding to the target voice data comprises:

acquiring a voice data text corresponding to target voice data;

if the length of the voice data text is equal to the preset text length, sequentially filling the voice data text into each position forming the preset text length to obtain the target text information;

if the length of the voice data text is smaller than the preset text length, adding at least one designated character after the last character of the voice data text, and filling each position forming the preset text length in sequence to obtain the target text information; and the sum of the length of the voice data text and the length of the at least one designated character is the preset text length.

4. The method according to any one of claims 1-3, wherein the inputting the target text information into a pre-trained noise recognition model, and the obtaining the prediction probability of the target text information mapping to each preset noise label comprises:

5. The method of claim 4, wherein the feature extraction network comprises: the device comprises an input layer, a character embedding layer, a convolution layer, an activation layer, a pooling layer and a fusion layer;

6. The method of claim 5, wherein the classification network comprises: a fully connected layer and a normalization layer;

7. The method of claim 4, wherein the training of the noise recognition model comprises:

8. The method according to any one of claims 1-4, further comprising:

9. An instruction recognition method, the method comprising:

determining target noise text information in the instruction text information corresponding to the target instruction by using the noise extraction method of any one of claims 1 to 8;

deleting the target noise text information in the instruction text information to obtain text information to be identified;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-9 when executing a program stored in the memory.