CN118283339A

CN118283339A - Display device, server and voice instruction recognition method

Info

Publication number: CN118283339A
Application number: CN202410522155.5A
Authority: CN
Inventors: 崔保磊; 张大钊; 李含珍; 杜永花
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2024-04-28
Filing date: 2024-04-28
Publication date: 2024-07-02

Abstract

The application provides a display device, a server and a voice command recognition method, wherein the method can respond to a voice recognition command, acquire first audio through a sound collector, and execute beam suppression on the first audio to obtain second audio. And then, executing characteristic sound detection on the second audio according to the main detection channel, if the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal to obtain a control instruction, and if the characteristic sound signal is not detected in the second audio, executing the characteristic sound detection on the first audio through the branch detection channel to obtain the control instruction, and generating a target user interface based on the control instruction. According to the application, through respectively detecting the original audio input by the user and the audio after the beam suppression is executed, when the display device mistakenly suppresses the human voice signal beam in the audio, the voice interaction of the display device is carried out according to the recognition result of the original audio, and the accuracy of the voice interaction is improved.

Description

Display device, server and voice instruction recognition method

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a display device, a server, and a speech instruction recognition method.

Background

Voice control is a man-machine interaction technique whereby a user can interact with a display device or system via voice commands without the use of physical buttons, touch screens, or other conventional input means. The display device may convert the user's voice input into computer-understandable commands or instructions to perform the corresponding operations.

The display device can eliminate noise in the voice command according to sound source localization through a noise suppression technology, so that the accuracy of the display device for recognizing the voice command is improved. But is not limited to. When a user controls a television through a voice command, the problems that human voice cannot be directly transmitted to a microphone array exist in the situations of a large-screen television, a wide-distance living room, unfixed microphone orientation, different house layouts and the like. This can lead to localization of the incoming reflected sound interfering with the sound source, while noise suppression techniques misinterpret the human voice as noise to cancel it, thus resulting in situations where the voice command is not accurately recognized.

Disclosure of Invention

In order to improve accuracy of voice instruction recognition, some embodiments of the present application provide a display device, a server, and a voice instruction recognition method.

In a first aspect, some embodiments of the present application provide a display device comprising a display configured to display a user interface, a sound collector configured to obtain audio data, and a controller configured to:

Responding to a voice recognition instruction, acquiring first audio through the sound collector, and performing wave beam suppression on the first audio to obtain second audio;

Performing characteristic sound detection on the second audio according to the main path detection channel;

If the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal of the second audio to obtain a control instruction;

If the characteristic sound signal is not detected in the second audio, executing characteristic sound detection on the first audio through a branch detection channel, and executing instruction identification according to the characteristic sound signal of the first audio when the characteristic sound signal is detected in the first audio to obtain a control instruction;

And generating a target user interface according to the control instruction, and controlling the display to display the target user interface.

In some embodiments, the controller is further configured to:

Recording the detection duration of the second audio;

If the detection time length is greater than or equal to a first time length threshold, generating a timeout flag, wherein the timeout flag is used for representing timeout of the second audio detection;

And according to the timeout mark, performing characteristic sound detection on the first audio through a branch detection channel.

In some embodiments, if a characteristic acoustic signal is detected in the second audio, the controller is further configured to:

detecting a start time point and a pause time point of a characteristic acoustic signal of the second audio;

Recording the pause time of the characteristic sound signal according to the pause time point;

Marking the pause time point as an end time point when the pause time length is greater than or equal to a second time length threshold;

generating a signal to be identified according to the characteristic acoustic signal between the starting time point and the ending time point, and executing instruction identification on the signal to be identified to obtain a control instruction.

In some embodiments, the method further comprises the step of communicatively connecting a communication device configured to communicate with a server, the controller performing instruction recognition on the signal to be recognized, configured to:

Sending a first signal conversion request and the signal to be identified to the server; the first signal conversion request is generated when the characteristic sound signal is detected in the second audio, and the first signal conversion request is used for indicating the server to execute instruction recognition on the signal to be recognized;

And receiving a control instruction of the server for executing instruction identification feedback on the signal to be identified in response to the first signal conversion request.

In some embodiments, the controller performs beam suppression on the first audio configured to:

Performing noise reduction processing on the first audio to obtain noise reduction audio;

Performing audio analysis on the noise reduction audio to obtain the sound source direction of the noise reduction audio;

and performing beam suppression on the noise reduction audio based on the sound source direction to obtain second audio.

In some embodiments, the controller performs a characteristic sound detection on the second audio according to a main detection channel, configured to:

performing a spectral analysis on the second audio to obtain an audio feature and a spectral signal;

Extracting target features from the audio features through a feature sound classifier, wherein the target features are used for representing the audio features corresponding to the feature sounds;

And extracting a characteristic sound signal from the frequency spectrum signal according to the target characteristic.

In some embodiments, the controller performs instruction recognition from the characteristic acoustic signal of the second audio, configured to:

Generating a signal text from the characteristic acoustic signal of the second audio according to a speech recognition model;

Inputting the signal text into a language processing model to obtain semantic information, wherein the semantic information is used for representing the classification probability of the signal text on semantic feature labels;

and generating a control instruction according to the semantic information and the semantic feature label.

In some embodiments, when no characteristic acoustic signal is detected in the first audio, the controller is further configured to:

Generating prompt information, wherein the prompt information is used for representing that no characteristic sound signal is detected in the first audio and the second audio;

Creating an information popup window according to the prompt information;

and controlling the display to display the information popup window in the user interface.

In a second aspect, some embodiments of the present application provide a server comprising a communication module and an identification module, the communication module configured to be communicatively coupled to a display device; the identification module is configured to:

Acquiring a characteristic sound signal sent by display equipment; the characteristic sound signal is obtained by the display device executing characteristic sound detection on the second audio through the main detection channel, and if the characteristic sound signal is not detected in the second audio, the characteristic sound signal is obtained by the display device executing characteristic sound detection on the first audio through the branch detection channel; the first audio is obtained by the display equipment through a sound collector in response to a voice recognition instruction, and the second audio is obtained by performing beam suppression on the first audio;

Executing instruction identification on the characteristic sound signals to obtain control instructions;

And sending the control instruction to the display device so that the display device can generate and display a target user interface in response to the control instruction.

In a third aspect, some embodiments of the present application provide a voice instruction recognition method, which is applied to the display device in the first aspect, and the method includes:

As can be seen from the above technical solutions, the present application provides a display device, a server, and a voice command recognition method, where the method can obtain a first audio through a sound collector in response to a voice recognition command, and perform beam suppression on the first audio to obtain a second audio. And then, executing characteristic sound detection on the second audio according to the main detection channel, if the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal to obtain a control instruction, and if the characteristic sound signal is not detected in the second audio, executing the characteristic sound detection on the first audio through the branch detection channel to obtain the control instruction, and generating a target user interface based on the control instruction. According to the application, through respectively detecting the original audio input by the user and the audio after the beam suppression is executed, when the display device mistakenly suppresses the human voice signal beam in the audio, the voice interaction of the display device is carried out according to the recognition result of the original audio, and the accuracy of the voice interaction is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device according to some embodiments of the present application;

fig. 2 is a schematic diagram of a hardware configuration of a display device according to some embodiments of the present application;

FIG. 3 is a schematic software configuration of a display device according to some embodiments of the present application;

FIG. 4 is a flowchart of a first embodiment of a display device performing voice command recognition according to some embodiments of the present application;

FIG. 5 is a schematic diagram of detecting a characteristic acoustic signal according to a detection channel according to some embodiments of the present application;

FIG. 6 is a flow chart of acquiring a characteristic acoustic signal according to some embodiments of the application;

FIG. 7 is a flow chart of acquiring a target feature according to some embodiments of the application;

FIG. 8 is a flow chart of generating signal text from a second audio in accordance with some embodiments of the application;

FIG. 9 is a flowchart of a second embodiment of a display device for performing voice command recognition according to some embodiments of the present application;

FIG. 10 is a flow chart of timeout detection according to some embodiments of the present application;

FIG. 11 is a flowchart of acquiring a signal to be identified according to a pause duration according to some embodiments of the present application;

FIG. 12 is a flow chart of instruction recognition by a server according to some embodiments of the application.

Detailed Description

Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the examples below do not represent all embodiments consistent with the application. Merely exemplary of systems and methods consistent with aspects of the application as set forth in the claims.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

In the embodiment of the present application, the display device 200 generally refers to a device having a screen display and a data processing capability. For example, display device 200 includes, but is not limited to, a smart television, a mobile terminal, a computer, a monitor, an advertising screen, a wearable device, a virtual reality device, an augmented reality device, and the like.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device according to some embodiments of the present application. As shown in fig. 1, a user may operate the display device 200 through a touch operation, the mobile terminal 300, and the control device 100. Wherein the control device 100 is configured to receive an operation instruction input by a user, and convert the operation instruction into a control instruction recognizable and responsive by the display device 200. For example, the control device 100 may be a remote control, a stylus, a handle, or the like.

The mobile terminal 300 may serve as a control device for performing man-machine interaction between a user and the display device 200. The mobile terminal 300 may also be used as a communication device for establishing a communication connection with the display device 200 for data interaction. In some embodiments, the mobile terminal 300 may install a software application with the display device 200, implement connection communication through a network communication protocol, and achieve the purpose of one-to-one control operation and data communication. The audio/video content displayed on the mobile terminal 300 can also be transmitted to the display device 200, so as to realize the synchronous display function.

In some embodiments, the mobile terminal 300 or other electronic device may also simulate the functions of the control device 100 by running an application program that controls the display device 200.

As also shown in fig. 1, the display device 200 is also in data communication with the server 400 via a variety of communication means. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks.

The display device 200 may provide a broadcast receiving tv function, and may additionally provide an intelligent network tv function of a computer supporting function, including, but not limited to, a network tv, an intelligent tv, an Internet Protocol Tv (IPTV), etc.

Fig. 2 is a block diagram of a hardware configuration of the display device 200 of fig. 1 according to some embodiments of the present application.

In some embodiments, the display apparatus 200 may include at least one of a modem 210, a communication device 220, a detector 230, a device interface 240, a controller 250, a display 260, an audio output device 270, a memory, a power supply, a user input interface.

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; either the detector 230 comprises an image collector, such as a camera, which may be used to collect external environmental scenes, user attributes or user interaction gestures, or the detector 230 comprises a sound collector, such as a microphone or the like, for receiving external sounds.

In some embodiments, display 260 includes display functionality for presenting pictures, and a drive component that drives the display of images. The display 260 is used for receiving and displaying image signals output from the controller 250. For example, the display 260 may be used to display video content, image content, and components of menu manipulation interfaces, user manipulation UI interfaces, and the like.

In some embodiments, the communication apparatus 220 is a component for communicating with an external device or server 400 according to various communication protocol types. The display apparatus 200 may be provided with a plurality of communication devices 220 according to the supported communication manner. For example, when the display apparatus 200 supports wireless network communication, the display apparatus 200 may be provided with a communication device 220 including a WiFi function. When the display apparatus 200 supports bluetooth connection communication, the display apparatus 200 needs to be provided with a communication device 220 including a bluetooth function.

The communication means 220 may communicatively connect the display device 200 with an external device or the server 400 by means of a wireless or wired connection. Wherein the wired connection may connect the display device 200 with an external device through a data line, an interface, etc. The wireless connection may then connect the display device 200 with an external device through a wireless signal or a wireless network. The display device 200 may directly establish a connection with an external device, or may indirectly establish a connection through a gateway, a route, a connection device, or the like.

In some embodiments, the controller 250 may include at least one of a central processor, a video processor, an audio processor, a graphic processor, a power supply processor, first to nth interfaces for input/output, and the controller 250 controls the operation of the display device and responds to the user's operation through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, a user may input a user command through a graphical user interface (GRAPHICAL USER INTERFACE, GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI).

In some embodiments, audio output device 270 may be a speaker local to display device 200 or an audio output device external to display device 200. For an external audio output device of the display device 200, the display device 200 may also be provided with an external audio output terminal, and the audio output device may be connected to the display device 200 through the external audio output terminal to output sound of the display device 200.

In some embodiments, user input interface 280 may be used to receive instructions from user input.

To perform user interactions, in some embodiments, display device 200 may be run with an operating system. The operating system is a computer program for managing and controlling hardware resources and software resources in the display device 200. The operating system may control the display device to provide a user interface, for example, the operating system may directly control the display device to provide a user interface, or may run an application to provide a user interface. The operating system also allows a user to interact with the display device 200.

It should be noted that, the operating system may be a native operating system based on a specific operating platform, a third party operating system customized based on a depth of the specific operating platform, or an independent operating system specially developed for a display device.

The operating system may be divided into different modules or tiers depending on the functionality implemented, for example, as shown in FIG. 3, in some embodiments the system is divided into four layers, an application layer (simply "application layer"), an application framework layer (Application Framework) layer (simply "framework layer"), a system library layer, and a kernel layer, from top to bottom, respectively.

In some embodiments, the application layer is used to provide services and interfaces for applications so that the display device 200 can run applications and interact with users based on the applications. At least one application program can be run in the application program layer, and the application programs can be a Window (Window) program, a system setting program or a clock program of an operating system; or may be an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an application programming interface (Application Programming Interface, API) and programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act. Through the API interface, the application program can access the resources in the system and acquire the services of the system in the execution.

It should be noted that the above examples are merely a simple division of functions of an operating system, and do not limit the specific form of the operating system of the display device 200 in the embodiment of the present application, and the number of levels and specific types of levels included in the operating system may be expressed in other forms according to factors such as the functions of the display device and the type of the operating system.

In some embodiments, the display device 200 may further have a voice interaction function, that is, the user may send a voice command to the display device 200 to control the display device 200 to execute an interaction effect corresponding to the voice command. The voice command may include various types, such as a voice start command, a voice play command, a voice pause command, a voice switch command, a voice wake command, etc., according to the interactive contents to be executed.

The display device 200 may integrate voice applications to facilitate performing voice recognition, semantic understanding, etc. of various types of voice instructions. The voice application may establish a data connection with the sound collector of the display device 200, and the display device 200 may collect, in real time, voice instructions issued by the user, such as voice instructions of "open application", "play music", or "search for weather", through the sound collector. And performs voice recognition and semantic understanding on the voice command through the voice application, and converts the voice command into a control command by comparing data and programs in the memory according to the recognition result, and controls the display device 200 to perform interaction corresponding to the control command.

In some embodiments, the display device 200 may eliminate noise in the voice instructions according to sound source localization by a beam suppression technique to improve the accuracy of the voice instructions by the display device 200. The beam suppression technique suppresses sound signals in directions other than the sound source localization direction by enhancing the sound signals in the sound source localization direction to reduce noise interference in the voice command.

However, when the display device 200 is in a different layout situation or the placement orientation of the display device 200 is not fixed, the voice cannot be directly transmitted to the sound collector of the display device 200, for example, when the user sends a voice command to the display device 200 indoors, the voice can be reflected to the sound collector of the display device 200 under the action of objects such as furniture and walls, because the voice cannot be directly transmitted. This results in the display apparatus 200 failing to accurately judge the sound source position of the voice command from the reflected sound, and then causing the display apparatus 200 to reject the reflected voice as noise by mistake through the beam rejection technique, so that the display apparatus 200 fails to accurately recognize the voice command, and outputs a false recognition result.

In order to improve the accuracy of recognition of voice instructions, some embodiments of the present application provide a display device 200, where the display device 200 includes a display 260, the display 260 is used to display a user interface, and a sound collector user obtains audio data input by a user. The controller is configured to perform a voice instruction recognition method.

Fig. 4 is a flowchart of a display device according to an embodiment of the present application for performing voice command recognition. Referring to fig. 4, the method includes:

S100: and responding to the voice recognition instruction, acquiring first audio through the sound collector, and performing beam suppression on the first audio to obtain second audio.

The display device 200 may control the sound collector to be turned on in response to the voice recognition instruction to acquire audio data of an area where the display device 200 is located through the sound collector and convert the audio data into a digital signal, i.e., a first audio. The display device 200 may also directly use the audio data acquired by the sound collector as the first audio. In this embodiment, the voice recognition instruction is an instruction in an audio format, and therefore, the display device 200 can also take the voice recognition instruction as the first audio.

After the first audio is acquired, the controller 250 may perform beam suppression on the first audio by the audio processor, resulting in the second audio. Beam suppression is an audio processing technique in which the controller 250 can acquire the sound source direction from the first audio and enhance the audio signal from the sound source direction, and at the same time, the controller 250 can suppress interference of noise signals from directions other than the sound source direction, thereby reducing the influence of noise or other audio interference factors on the voice recognition process of the display apparatus 200.

In some embodiments, the display device 200 may further include a regulating component for adjusting the orientation of the sound collector, thereby changing the receiving direction of the sound collector. In the course of the display apparatus 200 performing the beam suppression process on the first audio, the controller 250 may perform audio parsing on the first audio to acquire spatial distribution characteristics of the first audio and calculate a reception adjustment angle of the sound collector according to the spatial distribution characteristics. The controller 250 may also control the movement or rotation of the adjusting and controlling component according to the adjustment angle, so as to adjust the receiving direction of the sound collector and improve the universality of the audio signal in the direction of the receiving sound source.

The voice recognition instructions may also wake up the display device 200 by the user, and in some embodiments, the controller 250 may detect an operational state of the display device 200 in response to the voice recognition instructions, wherein the operational state includes an operational state and a standby state. When the display device 200 is in the standby state, the sound collector is in the off mode, and the controller 250 may switch the operation state of the display device 200 from the standby state to the operating state according to the voice recognition command, so as to start the sound collector to acquire the audio data of the area where the display device 200 is located.

S200: and executing characteristic sound detection on the second audio according to the main channel detection channel.

The characteristic sound may include a voice signal, a sound event, or a voice signal of a specific instruction word, for example, the characteristic sound is a voice signal, and after performing beam suppression on the first audio to obtain the second audio, the controller 250 may detect the voice signal in the second audio through the detection channel to extract the characteristic sound signal in the second audio, so as to facilitate the subsequent conversion of the voice signal into a control instruction, and control the display device 200 to complete the instruction interaction process.

Referring to fig. 5, the detection channels may include a main detection channel and a branch detection channel, and the priority of the main detection channel may be greater than or equal to that of the branch detection channel, and since the second audio is audio data after performing beam suppression, the second audio is affected by environmental noise less than the first audio, and thus, a human voice signal in the second audio may be extracted through the main detection channel, and a corresponding audio detection operation may be performed on other audio data through the branch detection channel, thereby improving the efficiency of parallel processing of the audio by the display device 200.

In performing the characteristic sound detection on the second audio through the main detection channel, as shown in fig. 6, the controller 250 may perform a spectral analysis on the second audio, for example, by using a fourier transform (Fourier Transform, FT), a short-time fourier transform (Fast Fourier transform, STFT), or the like, to convert a time-domain signal of the second audio into a spectral signal, so as to obtain a spectral representation of the second audio, so as to extract the characteristic sound signal from the spectral signal. In this process, the controller 250 may also obtain spectral characteristic parameters such as spectral amplitude, frequency, phase, and the like in the spectral signal, so as to improve the accuracy of extracting the characteristic acoustic signal according to the spectral characteristic parameters.

In performing the spectral analysis, the controller 250 may also extract audio features of the second audio. The audio features are features capable of characterizing audio characteristics of the second audio and may include mel-frequency cepstral coefficients (MFCCs), audio time-domain, frequency-domain, and the like.

The controller 250 may input the audio features into a feature sound classifier to extract target features from the audio features through the feature sound classifier. The target feature is used for representing the audio feature corresponding to the characteristic sound, taking the characteristic sound signal as a human sound signal as an example, and the target feature is the audio feature corresponding to the human sound signal.

The characteristic acoustic classifier may be a machine learning or deep learning model based classifier such as a Support Vector Machine (SVM), neural network, etc. Correspondingly, the controller 250 may acquire a large amount of audio data with classification labels as training data, and train the feature acoustic classifier based on the target spectrum parameters, so that the feature acoustic classifier has the capability of extracting the target features from the audio features, where the spectrum parameters correspond to the target parameters when the spectrum parameters are target. After the audio features are input into the feature sound classifier, the feature sound classifier extracts the audio features which accord with the target spectrum parameters from the audio features as target features.

In some embodiments, as shown in fig. 7, the characteristic acoustic classifier may also filter the audio characteristics during the process of extracting the target characteristics, for example, by calculating a correlation coefficient between the audio characteristics and the target spectrum parameter, and filtering the audio characteristics according to a preset correlation threshold. For example, the characteristic sound classifier may perform labeling on the audio features greater than or equal to the correlation coefficient, and perform feature summarization on the labeled audio features to obtain the target features.

After obtaining the target feature, the controller 250 may locate, in the spectrum signal, a position of the characteristic acoustic signal in the spectrum signal according to the screened target feature by using a feature matching algorithm, and extract, according to the locating result, the characteristic acoustic signal corresponding to the target feature from the spectrum signal. The controller 250 may also combine filtering technology to filter out noise and other interference signals in the characteristic acoustic signal during the process of extracting the characteristic acoustic signal, so as to improve the audio quality and audio stability of the characteristic acoustic signal.

S300: and if the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal of the second audio to obtain a control instruction.

When the controller 250 detects a characteristic acoustic signal in the second audio, indicating that beam suppression does not suppress the characteristic acoustic signal in the first audio, the controller 250 may perform instruction recognition on the resulting characteristic acoustic signal.

The memory of the display device 200 may store a library of instructions that may contain all control instructions to which the display device 200 may respond. The controller 250 may determine the control instructions at the instruction library based on the characteristic acoustic signal.

In some embodiments, the memory may further store a speech recognition model for recognizing text information in the characteristic acoustic signal and a language processing model for performing semantic understanding on the text information to obtain semantic information.

FIG. 8 is a flow chart of generating signal text from a characteristic acoustic signal in accordance with an embodiment of the present application. Referring to fig. 8, when the characteristic acoustic signal is detected in the second audio, the controller 250 may input the characteristic acoustic signal into the voice recognition model, thereby performing text conversion on the characteristic acoustic signal of the second audio through the voice recognition model, converting the characteristic acoustic signal in the audio format into signal text, which is a text representation of the characteristic acoustic signal. For example, the controller 250 inputs the characteristic acoustic signal "query for today's weather information", i.e., audio format "query for today's weather information", into a voice recognition model, which performs voice recognition on the audio format "query for today's weather information", and outputs text format "query for today's weather information", i.e., signal text, according to the recognition result.

After acquiring the signal text, the controller 250 may input the signal text into a language processing model, and perform semantic understanding on the signal text through the voice processing model to obtain semantic information. The semantic information is used to characterize the classification probability of the signal text to the semantic feature label, the controller 250 may also set a decision threshold to determine a classification result of the signal text and the semantic feature label, for example, when the classification probability is greater than or equal to the decision threshold, the controller 250 may determine that the signal text is a control instruction corresponding to the semantic feature label, and then may determine the control instruction according to the semantic information and the semantic feature label.

After determining the semantic feature labels, controller 250 may also look up control instructions in the instruction library based on the semantic feature labels. When the classification probability is smaller than the decision threshold, the controller 250 may output the highest classification probability through the language processing model, and acquire the semantic feature label corresponding to the highest classification probability, so as to determine the control instruction according to the semantic information and the semantic feature label. The controller 250 may further set a minimum decision threshold, and if the classification probabilities are smaller than the minimum decision threshold, the controller 250 may cancel the language processing model from outputting semantic information, so as to reduce a situation that the generated control instruction does not match with the characteristic acoustic signal of the second audio.

In some embodiments, at least two different voice commands may be included in the characteristic acoustic signal, for example, the characteristic acoustic signal may be "play music first, then open the browser", where the controller 250 cannot find a corresponding control command in the command library according to the semantic feature tag, so when the corresponding control command is not found, the controller 250 may generate a new control command according to the semantic information and the semantic feature tag, and store the new control command in the command library.

S400: if the characteristic sound signal is not detected in the second audio, the characteristic sound detection is carried out on the first audio through a branch detection channel, and when the characteristic sound signal is detected in the first audio, instruction identification is carried out according to the characteristic sound signal of the first audio, so that a control instruction is obtained.

If the characteristic acoustic signal is not detected in the second audio, it is indicated that the controller 250 misdetermines the characteristic acoustic signal as noise suppression in the process of performing beam suppression on the first audio, and therefore, the controller 250 does not detect the characteristic acoustic signal in the second audio, resulting in failure of the display apparatus 200 to recognize the second audio.

For this reason, when the characteristic sound signal is not detected in the second audio, the controller 250 may perform the characteristic sound detection on the first audio through the branch detection channel, wherein the flow of the characteristic sound detection may refer to the detection flow in step S300, which is not described in detail in this embodiment.

In some embodiments, in order to improve the accuracy of performing characteristic sound detection on the first audio, the controller 250 may perform noise reduction processing on the first audio after acquiring the first audio. During the noise reduction process, the controller 250 may perform noise analysis on the first audio to obtain an analysis result. The analysis result may include noise types such as background noise, e.g., noise emitted from a cooling device inside the display apparatus 200, environmental noise, e.g., echo generated in an area where the display apparatus 200 is located or attenuation of an acoustic signal according to a distance, etc.

The controller 250 may determine a noise reduction manner of the first audio according to the noise type in the analysis result, for example, perform noise reduction processing on the first audio by a noise reduction algorithm, audio filtering, deep learning noise reduction, and the like, to obtain noise reduction audio.

In some embodiments, the controller 250 may obtain a sound source direction of the noise reduction audio, for which the controller 250 may perform sound source localization on the noise reduction audio through an audio parsing technique to obtain a sound source localization result, where the sound source localization result may include directions of respective sound sources in the noise reduction audio, the controller 250 may separate different sound sources in the sound source localization result, obtain independent audio signals of the sound sources, and aggregate the sound source directions of the independent audio signals to determine the sound source direction of the noise reduction audio, so as to perform beam suppression on the noise reduction audio based on the sound source direction, and obtain the second audio.

Since the process of performing the feature sound detection on the second audio is already completed by the controller 250 in the process of performing the feature sound detection on the first audio, the controller 250 may further continue to perform the feature sound detection on the first audio through the main detection channel after performing the feature sound detection on the second audio, so as to save the detection line power consumption of the display apparatus 200.

S500: and generating a target user interface according to the control instruction, and controlling the display to display the target user interface.

After performing the instruction recognition according to the feature sound signal to obtain the control instruction, the controller 250 may start a corresponding function program according to the control instruction, for example, the control instruction is a "query weather" instruction obtained by performing the instruction recognition according to the second audio "query today's weather information", the controller 250 may perform keyword extraction on the "query weather" to obtain "query" and "weather", and control the display device 200 to run an application program corresponding to the query weather information based on the keywords, and generate a target user interface based on the running application program, and control the display 260 to display the target user interface, where the target user interface may include real-time weather information to complete the voice recognition display of the first audio.

It should be understood that the association procedure may be synchronously performed according to the interactive effect of the control instruction while the target user interface is generated according to the control instruction, for example, when the target interface is generated according to the "query weather" control instruction, the real-time information and the positioning information of the display apparatus 200 may be acquired through the display apparatus 200 according to the real-time and the geographical information, and the "query weather" control instruction may be performed based on the real-time information and the positioning information.

In some embodiments, the controller 250 may also generate feedback speech or feedback information when the display 260 displays the target user interface, wherein the controller 250 may control the audio output device 270 to play the feedback speech and display the feedback information in the target user interface. For example, when the display device 200 responds to the control instruction of "inquiring weather", and displays the target user interface, the controller 250 may generate the feedback voice of "completing the current weather inquiry", and play it through the audio output device 270, so as to prompt the user to control the execution of the instruction, thereby improving the convenience and accuracy of interaction.

The control instructions may further include instructions for adjusting volume, switching channels, or switching an operation state of the display device 200, so as to implement control interaction with the real device 200 according to the control instructions. In some embodiments, the control instruction may include a search instruction to identify corresponding search content through the characteristic acoustic signal and control the display device 200 to search corresponding media data, for example, search data of music, video, news, or the like, according to the search content.

Fig. 9 is a flowchart of a second embodiment of identification of a display device execution instruction according to an embodiment of the present application. Referring to fig. 9, the controller 250 may also perform the characteristic sound detection on the first audio through the branch detection channel in parallel while performing the characteristic sound detection on the second audio through the main detection channel, i.e., perform S200 and S300 simultaneously. Since the main detection channel increases the beam suppression process compared to the branch detection channel, if the characteristic acoustic signal is detected in the first audio, the output speed of the branch detection channel is faster than that of the main detection channel, and the controller 250 may perform the determination process on the source of the characteristic acoustic signal.

When the controller 250 acquires the control instruction, source detection may be performed on the control instruction in time sequence, and if the control instruction detected first originates from the main detection channel, the target user interface may be directly generated according to the control instruction, and the detection flow of the first audio by the branch detection channel is cancelled. If the first detected control command originates from the branch detection channel, the controller 250 may save the control command of the branch detection channel. In this process, if the controller 250 detects a control instruction from the main detection channel output, the stored control instruction is canceled, and a target user interface is generated based on the control instruction from the main detection channel output.

If the main channel detection channel does not recognize the characteristic sound signal in the second audio after the control instruction of the branch channel detection channel is stored, the main channel detection channel cannot output an effective control instruction, so the controller 250 can directly generate a target user interface according to the control instruction of the branch channel detection channel and control the display 260 to display the target user interface, so that the time for executing the characteristic sound recognition on the first audio after the characteristic sound signal is not recognized in the second audio is saved by executing the characteristic sound recognition on the first audio in parallel, and the speech recognition efficiency is improved.

It should be noted that, the beam suppression may suppress noise other than the direction of the sound source in the first audio, and in a case where the noise is insufficient to affect the recognition of the characteristic sound, the instruction recognition result of the controller 250 for outputting the characteristic sound of the first audio and the characteristic sound of the second audio may be the same. If the noise has a large influence on the characteristic sound of the first audio, the instruction recognition result of the controller 250 for outputting the characteristic sound of the first audio and the characteristic sound of the second audio may be different, and the corresponding output different instruction recognition results, that is, the control instruction may also be different. The present embodiment may perform audio comparison analysis on the first audio and the second audio by the controller 250, thereby obtaining a noise value according to the comparison analysis result, the noise value being used to determine the degree of influence of noise on audio recognition.

When the controller 250 performs the characteristic sound detection on the second audio through the main detection channel for a long time, it may be determined that the characteristic sound signal is not detected in the second audio in order to improve the detection efficiency. To this end, in some embodiments, as shown in fig. 10, the controller 250 may set a first time length threshold and record a detection time length of the second audio in detecting the second audio through the main detection channel. The controller 250 may determine a detection duration of the second audio according to the first duration threshold to obtain a detection timeout state of the second audio. If the detection time period is greater than or equal to the first time period threshold, the controller 250 may generate a timeout flag, where the timeout flag is used to characterize that the main detection channel detects that the second audio detection is timeout, determine that no characteristic sound signal is detected in the second audio according to the timeout flag, and perform characteristic sound detection on the first audio through the branch detection channel.

The controller 250 may further determine that the process of performing the characteristic sound detection on the first audio by using the branch detection channel is timeout according to the same determination process, and if the detection duration of the first audio by using the branch detection channel is greater than or equal to the first time threshold, the controller 250 may generate a detection failure flag, where a user characterizes that the characteristic sound signal is not detected in the first audio.

In some embodiments, when no characteristic acoustic signal is detected in the first audio, the controller 250 may indicate to the user that no characteristic acoustic signal is detected in both the first audio and the second audio in accordance with generating the alert message. The prompt may include information content such as "voice detection failed", "unsuccessful recognition voice content", etc., prompting the user to perform subsequent voice recognition actions based on the prompt.

The prompt information may be in various forms, such as a prompt voice or a prompt text, and the controller 250 may control the corresponding hardware component of the display device 200 to output the prompt information, for example, play the prompt voice through the audio output device 270, or control the display 260 to display the prompt text.

When the prompt message is a prompt text, the controller 250 may create a message pop according to the prompt text, and the prompt pop may include at least one prompt option, for example, a prompt option to re-perform voice recognition or an option to cancel voice recognition, and control the display 260 to display the message pop in the user interface.

The display device 200 may also respond to user selection instructions based on the information popup feedback, for example, when the user clicks on a prompt option to re-perform speech recognition, the controller 250 may generate an option instruction based on the prompt option and control the display 260 to display a speech recognition interface in response to the option instruction.

In some embodiments, as shown in fig. 11, when the characteristic acoustic signal is detected in the second audio, the controller 250 may also detect a start time point and a stop time point of the characteristic acoustic signal of the second audio. When the second audio is stopped, the display device 200 cannot determine whether the characteristic acoustic signal of the second audio is finished, and for this reason, the controller 250 may record the pause time of the characteristic acoustic signal according to the pause time point, and when the pause time is less than the second time threshold, it indicates that the characteristic acoustic signal of the second audio is not finished, and if the characteristic acoustic signal is acquired again in the remaining time, the controller 250 may refresh the pause time, and set the pause time to 0.

When the pause time period is greater than or equal to the second time period threshold, the display device 200 may determine that the user completes one voice command input, and the controller 250 may determine that the characteristic sound signal ends and mark the pause time point as an ending time point, so as to take the characteristic sound signal of the user during one voice command input as a command recognition object. The controller 250 may generate a signal to be recognized according to the characteristic acoustic signal between the start time point and the end time point, and perform instruction recognition on the signal to be recognized, to obtain a control instruction.

It should be appreciated that the characteristic acoustic signals input by the user may be continuous or intermittent, and thus, the display device 200 may obtain all the characteristic acoustic signals in one voice command input according to the start time point and the end time point, thereby improving accuracy of command recognition.

In some embodiments, to save power consumption of the display device 200, the process of performing instruction recognition on the signal to be recognized may be performed by the server 400. For this, the display apparatus 200 may establish a communication connection with the server 400 through a communication device. FIG. 12 is a flowchart of the method for identifying the execution instruction by the server according to the embodiment of the application. Referring to fig. 12, the controller 250 may transmit a first signal conversion request to the server 400 when acquiring a signal to be recognized, and upload the signal to be recognized to the server 400 in synchronization. Wherein a first signal conversion request is generated when the characteristic acoustic signal is detected in the second audio, the first signal conversion request being for instructing the server 400 to perform instruction recognition on the signal to be recognized.

The server 400 may be a cloud server, after the server 400 responds to the first signal conversion request and receives the signal to be identified, the server 400 may perform instruction identification on the signal to be identified at the cloud, generate a control instruction corresponding to the signal to be identified, and feed the control instruction back to the display device 200, so that the power consumption of the display device 200 is saved in a manner of identifying the signal to be identified at the cloud by the server 400.

Some embodiments of the present application provide a server 400, where the server 400 may perform an instruction recognition procedure for a characteristic acoustic signal by a user, and the server 400 includes a communication module and a recognition module, where the communication module may establish a communication connection with a communication device 220 of a display apparatus 200, so as to implement data transmission between the server 400 and the display apparatus 200. The recognition module is configured to perform instruction recognition on the characteristic acoustic signal uploaded by the display device 200, and is specifically configured to:

Acquiring a characteristic sound signal transmitted by the display device 200; the characteristic sound signal is obtained by the display device 200 performing characteristic sound detection on the second audio through the main detection channel, and if the characteristic sound signal is not detected in the second audio, the characteristic sound signal is obtained by the display device 200 performing characteristic sound detection on the first audio through the branch detection channel; the first audio is obtained by the display equipment through a sound collector in response to a voice recognition instruction, and the second audio is obtained by performing beam suppression on the first audio;

The control instructions are sent to the display device 200 to cause the display device 200 to generate and display a target user interface in response to the control instructions.

As can be seen from the above technical solution, the server 400 provided by the present application may perform instruction recognition on the characteristic acoustic signal by receiving the characteristic acoustic signal sent by the display device 200, and send the control instruction obtained by recognition back to the display device 200, so that the display device 200 performs a corresponding interaction effect according to the control instruction, and displays a corresponding interaction interface, thereby saving power consumption of the display device 200 in performing instruction recognition on the characteristic acoustic signal, and improving speech recognition efficiency.

Some embodiments of the present application further provide a voice instruction recognition method, where the method includes:

s100: responding to a voice recognition instruction, acquiring first audio through the sound collector, and performing wave beam suppression on the first audio to obtain second audio;

S200: performing characteristic sound detection on the second audio according to the main path detection channel;

S300: if the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal of the second audio to obtain a control instruction;

S400: if the characteristic sound signal is not detected in the second audio, executing characteristic sound detection on the first audio through a branch detection channel, and executing instruction identification according to the characteristic sound signal of the first audio when the characteristic sound signal is detected in the first audio to obtain a control instruction;

According to the technical scheme, the method can respond to the voice recognition instruction, acquire the first audio through the sound collector, and execute beam suppression on the first audio to obtain the second audio. And then, executing characteristic sound detection on the second audio according to the main detection channel, if the characteristic sound signal is detected in the second audio, executing instruction identification according to the characteristic sound signal to obtain a control instruction, and if the characteristic sound signal is not detected in the second audio, executing the characteristic sound detection on the first audio through the branch detection channel to obtain the control instruction, and generating a target user interface based on the control instruction. According to the application, through respectively detecting the original audio input by the user and the audio after the beam suppression is executed, when the display device mistakenly suppresses the human voice signal beam in the audio, the voice interaction of the display device is carried out according to the recognition result of the original audio, and the accuracy of the voice interaction is improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display configured to display a user interface;

a sound collector configured to obtain audio data;

A controller configured to:

2. The display device of claim 1, wherein the controller is further configured to:

Recording the detection duration of the second audio;

3. The display device of claim 1, wherein if a characteristic acoustic signal is detected in the second audio, the controller is further configured to:

4. A display device according to claim 3, further comprising a communication means configured to be communicatively connected to a server, the controller performing instruction recognition on the signal to be recognized, configured to:

5. The display device of claim 1, wherein the controller performing beam suppression on the first audio is configured to:

6. The display device of claim 1, wherein the controller performs characteristic acoustic detection on the second audio according to a main detection channel, configured to:

7. The display device of claim 1, wherein the controller performs instruction recognition from the characteristic acoustic signal of the second audio, configured to:

8. The display device of claim 1, wherein when no characteristic acoustic signal is detected in the first audio, the controller is further configured to:

Creating an information popup window according to the prompt information;

9. A server, comprising:

a communication module configured to be communicatively connected with a display device;

an identification module configured to:

10. A method of speech instruction recognition, applied to the display device of any one of claims 1-8, the display device comprising a display configured to display a user interface; a sound collector configured to obtain audio data; a controller; the method comprises the following steps;