CN112802452A

CN112802452A - Junk instruction identification method and device

Info

Publication number: CN112802452A
Application number: CN202011521158.5A
Authority: CN
Inventors: 胡晓慧; 孟振南; 雷欣; 李志飞
Original assignee: Go Out And Ask Wuhan Information Technology Co ltd
Current assignee: Volkswagen China Investment Co Ltd; Mobvoi Innovation Technology Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-05-14

Abstract

Disclosed are a method and device for identifying garbage instructions. The method should include acquiring audio information; converting the audio information into text information; extracting audio features of the audio information to generate an audio feature set; using a pre-trained text model to acquire the feature vector of the text information; converting the audio The feature set and the feature vector are input as a deep neural network classifier, and whether the audio information is a junk instruction is determined according to the output of the deep neural network.

Description

Junk instruction identification method and device

Technical Field

The application relates to the technical field of natural language processing, in particular to a junk instruction identification method and device.

Background

At present, most intelligent equipment has a voice recognition function, and the voice recognition state of the intelligent equipment has two types, one type is an awakening-free state, and the other type is an awakening state. The wake-free state is the most different from the wake state in that the user speaks the wake word to wake up the smart device first and then speaks the command after the smart device is woken up. The voice of the user received by the intelligent device after being awakened can be regarded as a valid instruction, so that the functional identification of the instruction content (such as weather checking, music playing and the like) is directly carried out. And the intelligent equipment can realize the state of continuous conversation by waking up once in the state of no waking up, and does not need to say a waking up word once in each conversation, thereby being capable of obtaining better user experience.

However, in the wake-up-free state, the smart device is required to be able to recognize whether the received audio is an instruction to the smart device, filter out invalid interference information, and then react. Therefore, how to identify whether the received audio is a spam instruction with high quality needs to be solved.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and an apparatus for identifying a junk instruction, which can identify whether a received audio is a junk instruction with high quality, thereby improving the accuracy of audio identification of an intelligent device in an awake-free state and improving user experience.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a method for identifying a spam instruction, where the method includes:

acquiring audio information;

converting the audio information into character information;

extracting audio features of the audio information to generate an audio feature set;

acquiring a feature vector of the text information by using a pre-trained text model;

and inputting the audio feature set and the feature vector as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

Preferably, after the acquiring the audio information, the method further comprises: and if the audio information cannot be converted into character information, determining that the audio information is a junk instruction, and discarding the audio information.

Preferably, after the audio feature set and the feature vector are input as a deep neural network classifier and whether the audio information is a spam instruction is determined according to the output of the deep neural network, the method further comprises: if the audio information is not a junk instruction, performing natural language understanding on the text information, and executing an action corresponding to the audio information; discarding the audio information if the audio information is a spam instruction.

Preferably, the inputting the audio feature set and the feature vector as a deep neural network classifier, and determining whether the audio information is a spam instruction according to an output of the deep neural network includes: and synthesizing the audio feature set and the feature vector into a one-dimensional feature, inputting the one-dimensional feature as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

Preferably, the audio features include: voice audio features, voice text features, and voice duration.

In a second aspect, an embodiment of the present invention provides a garbage instruction recognition apparatus, where the apparatus includes:

a first acquisition unit configured to acquire audio information;

the conversion unit is used for converting the audio information into character information;

the generating unit is used for extracting the audio features of the audio information to generate an audio feature set;

the second acquisition unit is used for acquiring the feature vector of the text information by using a pre-trained text model;

and the determining unit is used for inputting the audio feature set and the feature vector as a deep neural network classifier and determining whether the audio information is a junk instruction according to the output of the deep neural network.

Preferably, the apparatus further comprises: and the discarding unit is used for determining that the audio information is a junk instruction and discarding the audio information if the audio information cannot be converted into character information.

Preferably, the apparatus further comprises: the execution unit is used for performing natural language understanding on the text information and executing the action corresponding to the audio information if the audio information is not a junk instruction; a discarding unit configured to discard the audio information if the audio information is a spam instruction.

Preferably, the determining unit is specifically configured to: and synthesizing the audio feature set and the feature vector into a one-dimensional feature, inputting the one-dimensional feature as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is configured to execute the method for identifying a spam instruction according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instruction from the memory, and execute the instruction to implement the method for identifying spam instructions according to the first aspect.

By utilizing the junk instruction identification method and device provided by the invention, the audio characteristics of the received audio information and the characteristics of the text information corresponding to the audio information are combined, the audio characteristics and the text characteristics are simultaneously used as the input of the deep neural network classifier, and the deep neural network classifier is utilized for identification, so that whether the received audio is the junk instruction or not can be identified with high quality, and therefore, the intelligent device can effectively filter invalid contents, accurately identify the user instruction and better improve the user experience in the state of no awakening.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic flowchart of a method for identifying a spam instruction according to an exemplary embodiment of the present application;

fig. 2 is a block diagram of a garbage instruction recognition apparatus according to an exemplary embodiment of the present application;

fig. 3 is a block diagram of another garbage instruction identification apparatus according to an exemplary embodiment of the present application;

fig. 4 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Fig. 1 is a schematic flowchart of a method for identifying a spam instruction according to an embodiment of the present application. The junk instruction identification method can be applied to electronic equipment, and as shown in fig. 1, the method includes:

step 101, audio information is acquired.

In an example, an application scenario of the junk instruction identification method is that the electronic device is in an awake-free state, and in this scenario, the obtained audio information may include: background sounds or voices, wherein a voice may be a valid instruction or may be content that a user chats.

Step 102, converting the audio information into text information.

Specifically, the audio information may be recognized by an Automatic Speech Recognition (ASR) module in the electronic device, and the audio information may be converted into text information.

It is understood that not all audio information may be identified and converted into text information, such as a noisy background sound, and if the received audio information cannot be converted into text information, the audio information may be considered as a spam command. Based on this, the method may further comprise:

and if the audio information cannot be converted into character information, determining that the audio information is a junk instruction, and discarding the audio information.

And 103, extracting the audio features of the audio information to generate an audio feature set.

Among other things, audio features include, but are not limited to: voice audio features, voice text features, and voice duration. And then, performing characteristic accumulation and combination on the generated audio characteristics to obtain an audio characteristic set.

And 104, acquiring a feature vector of the text information by using the pre-trained text model.

It should be noted that the process of obtaining the feature vector of the text information can be implemented by using the prior art, and is not described herein again.

And 105, inputting the audio feature set and the feature vector as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

Specifically, the method for determining whether the audio information is a junk instruction according to the output of the deep neural network by taking the audio feature set and the feature vector as the input of a deep neural network classifier comprises the following steps:

and synthesizing the audio feature set and the feature vector into a one-dimensional feature, inputting the one-dimensional feature as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

In a specific example, the audio feature set is a one-dimensional feature 1 with a length of m, the feature vector of the text message is a one-dimensional feature 2 with a length of n, the one-dimensional feature 1 and the one-dimensional feature 2 are spliced into a one-dimensional feature with a length of (m + n), the one-dimensional feature with the length of (m + n) is used as an input of a deep neural network classifier, and whether the audio message is a spam instruction is determined according to an output of the deep neural network.

In one example, the method may further comprise:

and if the audio information is not the junk instruction, performing natural language understanding on the text information and executing the action corresponding to the audio information.

If the audio information is a spam instruction, the audio information is discarded.

By using the junk instruction identification method provided by the embodiment of the invention, the audio characteristics of the received audio information and the characteristics of the text information corresponding to the audio information are combined, the audio characteristics and the text characteristics are simultaneously used as the input of the deep neural network classifier, and the deep neural network classifier is used for identification, so that whether the received audio is the junk instruction or not can be identified with high quality, and therefore, the intelligent device can effectively filter invalid contents, accurately identify the user instruction and better improve the user experience in the state of no awakening.

An embodiment of the present invention provides a garbage instruction recognition apparatus, and fig. 2 is a structural diagram of the garbage instruction recognition apparatus. The device can be applied to electronic equipment. As shown in fig. 2, the spam instruction identifying apparatus includes:

a first acquisition unit 201 for acquiring audio information;

a conversion unit 202, configured to convert the audio information into text information;

a generating unit 203, configured to extract audio features of the audio information to generate an audio feature set;

a second obtaining unit 204, configured to obtain a feature vector of the text information by using a pre-trained text model;

the determining unit 205 is configured to input the audio feature set and the feature vector as a deep neural network classifier, and determine whether the audio information is a spam instruction according to an output of the deep neural network.

Preferably, as shown in fig. 3, the apparatus further comprises: a discarding unit 206, configured to determine that the audio information is a spam instruction if the audio information cannot be converted into text information, and discard the audio information.

Preferably, as shown in fig. 3, the apparatus further comprises: an executing unit 207, configured to perform natural language understanding on the text information and execute an action corresponding to the audio information if the audio information is not a spam instruction; a discarding unit 206, configured to discard the audio information if the audio information is a spam instruction.

Preferably, the determining unit 205 is specifically configured to: and synthesizing the audio feature set and the feature vector into a one-dimensional feature, inputting the one-dimensional feature as a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

By utilizing the junk instruction identification device provided by the invention, the audio characteristics of the received audio information and the characteristics of the text information corresponding to the audio information are combined, the audio characteristics and the text characteristics are simultaneously used as the input of the deep neural network classifier, and the deep neural network classifier is utilized for identification, so that whether the received audio is a junk instruction or not can be identified with high quality, and therefore, the intelligent equipment can effectively filter invalid contents, accurately identify user instructions and better improve user experience under the state of no awakening.

Next, an electronic apparatus 11 according to an embodiment of the present application is described with reference to fig. 4.

As shown in fig. 4, the electronic device 11 includes one or more processors 111 and memory 112.

The processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.

Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 111 to implement the spam instruction identification methods of the various embodiments of the application described above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 11 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 113 may include, for example, a keyboard, a mouse, and the like.

The output device 114 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 114 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for the sake of simplicity, only some of the components of the electronic device 11 relevant to the present application are shown in fig. 4, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 11 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method for spam instruction identification according to various embodiments of the present application described in the above-mentioned "exemplary methods" section of this specification.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the junk instruction recognition method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. a garbage instruction identification method, is characterized in that, described method comprises:

get audio information;

converting the audio information into text information;

extracting the audio features of the audio information to generate an audio feature set;

Utilize the pre-trained text model to obtain the feature vector of the text information;

The audio feature set and the feature vector are input as a deep neural network classifier, and whether the audio information is a junk instruction is determined according to the output of the deep neural network.

2. The method according to claim 1, wherein after the acquiring audio information, the method further comprises:

If the audio information cannot be converted into text information, it is determined that the audio information is a junk instruction, and the audio information is discarded.

3. The method according to claim 1, wherein, in the input of the audio feature set and the feature vector as a deep neural network classifier, the audio information is determined according to the output of the deep neural network After whether it is a garbage instruction, the method further includes:

If the audio information is not a junk instruction, perform natural language understanding on the text information, and perform an action corresponding to the audio information;

If the audio information is a junk instruction, the audio information is discarded.

4. The method according to claim 1, wherein the audio feature set and the feature vector are input as a deep neural network classifier, and whether the audio information is determined according to the output of the deep neural network For garbage instructions, including:

The audio feature set and the feature vector are synthesized into a one-dimensional feature, and the one-dimensional feature is used as an input to a deep neural network classifier, and whether the audio information is a junk instruction is determined according to the output of the deep neural network.

5 . The method according to claim 1 , wherein the audio features comprise: voice audio features, voice text features, and voice duration. 6 .

6. A garbage instruction identification device, wherein the device comprises:

a first acquisition unit for acquiring audio information;

a conversion unit for converting the audio information into text information;

A generating unit for extracting the audio features of the audio information to generate an audio feature set;

The second acquisition unit utilizes a pre-trained text model to acquire the feature vector of the text information;

A determining unit, using the audio feature set and the feature vector as input to a deep neural network classifier, and determining whether the audio information is a junk instruction according to the output of the deep neural network.

7. The apparatus of claim 6, wherein the apparatus further comprises:

A discarding unit, configured to determine that the audio information is a garbage instruction if the audio information cannot be converted into text information, and discard the audio information.

8. The apparatus according to claim 6, wherein the apparatus further comprises:

an execution unit, configured to perform natural language understanding on the text information if the audio information is not a junk instruction, and execute an action corresponding to the audio information;

A discarding unit, configured to discard the audio information if the audio information is a garbage instruction.

9. The apparatus according to claim 1, wherein the determining unit is specifically configured to:

10 . The device according to claim 1 , wherein the audio features comprise: voice audio features, voice text features, and voice duration. 11 .

11. A computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is used to execute the garbage instruction identification method according to any one of the preceding claims 1-5.

12. An electronic device comprising:

processor;

a memory for storing the processor-executable instructions;

The processor is configured to read the executable instruction from the memory, and execute the instruction to implement the garbage instruction identification method according to any one of the preceding claims 1-5.