CN117809646A - Display device, server, voice processing method, and storage medium - Google Patents

Display device, server, voice processing method, and storage medium Download PDF

Info

Publication number
CN117809646A
CN117809646A CN202311104802.2A CN202311104802A CN117809646A CN 117809646 A CN117809646 A CN 117809646A CN 202311104802 A CN202311104802 A CN 202311104802A CN 117809646 A CN117809646 A CN 117809646A
Authority
CN
China
Prior art keywords
voice command
voice
instruction
command
integrity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311104802.2A
Other languages
Chinese (zh)
Inventor
巩家旭
刘天娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202311104802.2A priority Critical patent/CN117809646A/en
Publication of CN117809646A publication Critical patent/CN117809646A/en
Pending legal-status Critical Current

Links

Landscapes

  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention discloses a display device, a server, a voice processing method and a storage medium, relating to the technical field of voice interaction, wherein the display device comprises: a controller configured to: under the condition that the pause duration of the voice command is longer than the pause threshold value, determining a tail point continuity detection result of the voice command according to the audio signal power of the voice command; sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command based on an analysis result; and receiving a feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction. By applying the technical scheme of the invention, the accuracy of detecting the integrity of the voice command can be improved.

Description

Display device, server, voice processing method, and storage medium
Technical Field
The present invention relates to the field of voice interaction technologies, and in particular, to a display device, a server, a voice processing method, and a storage medium.
Background
With the development of voice processing technology and computer technology, voice interaction is an important way of man-machine interaction, for example, a user can enter voice instructions through voice functions provided by a display device, so that the display device can execute actions indicated by the voice instructions, such as opening an application program, starting and stopping a machine, and the like.
However, in practical applications, due to different habits such as the speech speed of each person speaking, the display device sometimes recognizes the input state of the voice command as completed during the pause of the user when receiving the voice command, resulting in erroneous recognition of the operation intention of the user, and sometimes continues to wait and receive the voice command when the user completes the input of the voice command, resulting in a long waiting time for the user to give feedback. Therefore, the integrity detection accuracy of the voice command is not high, the response speed is low, and the user experience is poor.
Disclosure of Invention
The embodiment of the invention provides a display device, a server, a voice processing method and a storage medium, which are used for solving the problem of low accuracy of integrity detection of voice instructions in the prior art.
In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme:
according to an aspect of an embodiment of the present invention, there is provided a display apparatus including: a display configured to display a user interface; a receiver configured to receive a voice instruction input by a user; a controller coupled to the display and the receiver, respectively, the controller configured to: detecting the pause time of the voice command when the voice command is received; determining a tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction under the condition that the pause duration is greater than a pause threshold value, wherein the pause threshold value is determined according to the acoustic characteristic parameters of the user; sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command of the voice command based on the analysis result; and receiving the feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction.
In some embodiments, the dwell threshold includes a first threshold and a second threshold, the first threshold being less than or equal to the second threshold; the controller is configured to: determining that the integrity detection result indicates that the voice instruction is an incomplete voice instruction under the condition that the pause duration is smaller than the first threshold value; and under the condition that the pause duration is greater than the second threshold value, determining that the integrity detection result indicates that the voice instruction is a complete voice instruction.
In some embodiments, the controller is further configured to: before receiving the voice instruction, receiving a wake-up instruction input by the user; extracting a voiceprint feature vector of the wake-up instruction, matching the voiceprint feature vector with a voiceprint feature vector of a preset voiceprint feature vector library, and determining an acoustic feature parameter corresponding to the wake-up instruction, wherein the acoustic feature parameter comprises at least one of a speech rate parameter, a voice intensity parameter and a tone parameter; determining the first threshold based on acoustic characteristic parameters corresponding to the wake-up instruction; and adjusting the first threshold according to the acoustic characteristic parameter library corresponding to the user to obtain the second threshold.
In some embodiments, the controller is configured to: according to the audio signal power of the voice command, calculating the audio signal power increment between the last two adjacent audio signal frames; determining that the tail point continuity detection result indicates that the tail point of the voice command is discontinuous under the condition that the audio signal power increment is larger than a first increment threshold; and under the condition that the power increment of the audio signal is smaller than a second increment threshold, determining that the tail point continuity detection result indicates the tail point continuity of the voice command, wherein the second increment threshold is smaller than or equal to the first increment threshold.
In some embodiments, the controller is configured to: determining that the integrity detection result indicates that the voice instruction is an incomplete voice instruction under the condition that the tail point continuity detection result indicates that the tail point of the voice instruction is discontinuous; and executing the step of sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail points of the voice command are continuous.
In some embodiments, the controller is further configured to: and when the integrity detection result is determined to indicate that the voice command is an incomplete voice command, receiving a continuation command of the voice command, and determining the integrity detection result of the voice command according to the continuation command.
According to another aspect of an embodiment of the present invention, there is provided a server including: a controller configured to: receiving an integrity detection request sent by display equipment; responding to the integrity detection request, and detecting the semantic integrity of the voice instruction by adopting a pre-trained semantic integrity model; under the condition that the voice command is a complete voice command, carrying out semantic analysis on the voice command, and determining a feedback command of the voice command based on an analysis result; the feedback instruction is sent to the display equipment, so that the display equipment controls a display to display a target interface corresponding to the feedback instruction; the integrity detection request is generated when the display device detects that the pause duration of the voice command is greater than a pause threshold value when receiving the voice command, the tail point continuity detection result of the voice command is determined to indicate that the tail point of the voice command is a non-pause tail point according to the audio signal power of the voice command, and the pause threshold value is determined according to acoustic characteristic parameters of a user.
In some embodiments, the controller is further configured to: matching the voice command with the voice command in the complete voice command library, and determining that the voice command is the complete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the complete voice command library; and/or matching the voice command with the voice command in the incomplete voice command library, and determining that the voice command is the incomplete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the incomplete voice command library.
In some embodiments, the controller is further configured to: training the initial semantic integrity model by using a training data set to generate the pre-trained semantic integrity model.
In some embodiments, the controller is further configured to: acquiring a real voice instruction set, and segmenting a high-frequency voice instruction in the real voice instruction set to obtain a plurality of sub-voice instructions; analyzing each sub-voice instruction by adopting a semantic recognition model, and determining a semantic recognition result corresponding to each sub-voice instruction; when the semantic recognition result corresponding to any one sub-voice instruction is different from the semantic recognition result of the high-frequency voice instruction, marking the any one sub-voice instruction as a first class, wherein the first class is used for indicating the any one sub-voice instruction as an incomplete voice instruction; or when the semantic recognition result corresponding to any one sub-voice instruction is the same as the semantic recognition result of the high-frequency voice instruction, marking the any one sub-voice instruction as a second class, wherein the second class is used for indicating that the any one sub-voice instruction is a complete voice instruction.
According to still another aspect of the embodiment of the present invention, there is provided a voice processing method applied to a display device, the method including: detecting the pause time of a voice instruction when receiving the voice instruction input by a user; determining a tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction under the condition that the pause duration is greater than a pause threshold value, wherein the pause threshold value is determined according to the acoustic characteristic parameters of the user; sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command of the voice command based on the analysis result; and receiving the feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction.
According to still another aspect of the embodiment of the present invention, there is provided a voice processing method, applied to a server, the method including: receiving an integrity detection request sent by display equipment; responding to the integrity detection request, and detecting the semantic integrity of the voice instruction by adopting a pre-trained semantic integrity model; under the condition that the voice command is a complete voice command, carrying out semantic analysis on the voice command, and determining a feedback command of the voice command based on an analysis result; the feedback instruction is sent to the display equipment, so that the display equipment controls a display to display a target interface corresponding to the feedback instruction; the integrity detection request is generated when the display device detects that the pause duration of the voice command is greater than a pause threshold value when receiving the voice command, the tail point continuity detection result of the voice command is determined to indicate that the tail point of the voice command is a non-pause tail point according to the audio signal power of the voice command, and the pause threshold value is determined according to acoustic characteristic parameters of a user.
According to yet another aspect of an embodiment of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction which when executed by a processor implements the operations of the speech processing method as described above.
According to the display device, the server, the voice processing method and the storage medium provided by the embodiment of the invention, the display device can detect the pause time of the voice command when receiving the voice command input by a user, determine the tail point continuity detection result of the voice command according to the audio signal power of the voice command when the pause time is greater than the pause threshold value, and send an integrity detection request to the server when the tail point continuity detection result indicates that the tail point of the voice command is a non-pause tail point, so that the server detects the semantic integrity of the voice command by adopting a pre-trained semantic integrity model, and performs semantic analysis on the voice command when the voice command is a complete voice command, and determine the feedback command of the voice command based on the analysis result; then, the display device can receive the feedback instruction sent by the server and control the display to display a target interface corresponding to the feedback instruction.
By the aid of the scheme, the display equipment can detect the pause time of the voice command in the process of receiving the voice command, and the display equipment can determine the tail point continuity detection result of the voice command according to the audio signal power of the voice command under the condition that the pause time is greater than the pause threshold value, so that preliminary judgment of the voice command integrity is achieved.
Drawings
Fig. 1 shows an interaction schematic diagram of a display device and a control device according to an embodiment of the present invention;
fig. 2 shows a block diagram of a configuration of a control device in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a voice interaction scenario provided by an embodiment of the present invention;
FIG. 4 is a flowchart of a speech recognition method according to an embodiment of the present invention;
FIG. 5 shows a sub-flowchart of a speech recognition method provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating processing of a speech signal according to an embodiment of the present invention;
fig. 7 shows a schematic diagram of sound spectrum of a voice command according to an embodiment of the present invention;
FIG. 8 is a sub-flowchart of another speech recognition method according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a tail point continuity detecting method according to an embodiment of the present invention;
fig. 10 shows a spectrum example of audio signal power provided by an embodiment of the present invention;
FIG. 11 is a flowchart of another speech recognition method according to an embodiment of the present invention;
FIG. 12 is a flow chart of yet another speech recognition method provided by an embodiment of the present invention;
FIG. 13 is a schematic diagram of an instruction database according to an embodiment of the present invention;
FIG. 14 is a flow chart of yet another speech recognition method provided by an embodiment of the present invention;
FIG. 15 is a schematic diagram of a method for labeling training data sets according to an embodiment of the present invention;
FIG. 16 is a schematic diagram of a speech processing method according to an embodiment of the present invention;
FIG. 17 is a flow chart illustrating yet another speech recognition method provided by an embodiment of the present invention;
FIG. 18 is a flow chart illustrating yet another speech recognition method provided by an embodiment of the present invention;
fig. 19 is a schematic diagram illustrating a stage of a speech recognition method according to an embodiment of the present invention.
Detailed Description
For the purposes of making the objects and embodiments of the present invention more apparent, an exemplary embodiment of the present invention will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present invention are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present invention.
It should be noted that the brief description of the terminology in the present invention is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present invention. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and any variations thereof herein are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
In the voice interaction process, in order to determine whether a user finishes inputting a voice instruction, namely whether to speak, a fixed time threshold T is generally adopted to judge the input state of the voice instruction, when the speaking pause time of the user is longer than the fixed time threshold T, the voice instruction is considered to finish inputting, otherwise, when the speaking pause time of the user is not longer than the fixed time threshold T, the voice instruction is considered to not finish inputting yet, and the voice instruction is continuously received.
However, the mode of setting the fixed time threshold T for judgment is quite distinctive, and each person has different speaking habits, so that the voice command when the user stops is easily recognized as the input, the user stops receiving the voice without speaking, and meanwhile, the problem that the user feels that the display device reacts very slowly due to the fact that the fixed time threshold T is not reached and the user waits continuously when speaking is finished is also caused. Therefore, the user experience is poor.
In view of one or more of the foregoing problems, embodiments of the present invention provide a display device, a server, a voice processing method, and a storage medium. Fig. 1 shows an interaction schematic diagram of a display device and a control apparatus according to an embodiment of the present invention. As shown in fig. 1, a user may operate the display apparatus 200 through the mobile terminal 300 or the control device 100. The control apparatus 100 may be a remote controller, and the remote controller and the display device 200 may communicate through an infrared protocol, a bluetooth protocol, or the remote controller may control the display device 200 in a wireless or other wired manner.
The user may input user instructions through keys on a remote controller, voice input, a control panel, etc., to control the display device 200. For example, the user may control the display device 200 to switch a displayed page through up-down keys on the remote controller, control the video played by the display device 200 to play or pause through play pause keys, and input a voice command through voice input keys to control the display device 200 to perform a corresponding operation.
In some embodiments, the user may also control the display device 200 using a mobile terminal, tablet, computer, notebook, and other smart device. For example, a user may control the display device 200 through an application installed on the smart device that, by configuration, may provide the user with various controls in an intuitive user interface on a screen associated with the smart device.
In some embodiments, the mobile terminal 300 may implement connection communication with a software application installed on the display device 200 through a network communication protocol for the purpose of one-to-one control operation and data communication. For example, it may be realized that a control instruction protocol is established between the mobile terminal 300 and the display device 200, a remote control keyboard is synchronized to the mobile terminal 300, a function of controlling the display device 200 is realized by controlling a user interface on the mobile terminal 300, or a function of transmitting contents displayed on the mobile terminal 300 to the display device 200 to realize synchronous display is also realized.
As shown in fig. 1, the display device 200 and the server 400 may communicate data in a variety of communication manners, which may allow the display device 200 to be communicatively connected via a local area network (Local Area Network, LAN), a wireless local area network (Wireless Local Area Network, WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. For example, the display device 200 receives software program updates by sending and receiving messages, and electronic program guide (Electrical Program Guide, EPG) interactions, or accesses a remotely stored digital media library. The server 400 may be one cluster or multiple clusters, and may include one or more types of servers.
The display device 200 may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection display device, a smart terminal, such as a mobile phone, a tablet computer, a smart television, a laser projection device, an electronic desktop (electronic table), etc. The specific display device type, size, resolution, etc. are not limited.
Fig. 2 shows a block diagram of a configuration of the control device 100 in an exemplary embodiment of the present invention, and as shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an operation instruction input by a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and may interact with the display device 200.
Taking a display device as an example of a television, fig. 3 shows a hardware configuration block diagram of a display device 200 according to an embodiment of the present invention. As shown in fig. 3, the display device 200 includes: a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, and at least one of a memory, a power supply, a user interface, a receiver.
The modem 210 may receive broadcast television signals through a wired or wireless reception manner and demodulate an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 may be used to collect signals of the external environment or interaction with the outside.
In some embodiments, the frequency point demodulated by the modem 210 is controlled by the controller 250, and the controller 250 may issue a control signal according to the user selection, so that the modem responds to the television signal frequency selected by the user and modulates and demodulates the television signal carried by the frequency.
The broadcast television signal may be classified into a terrestrial broadcast signal, a cable broadcast signal, a satellite broadcast signal, an internet broadcast signal, or the like according to different broadcasting systems of the television signal. Or may be differentiated into digital modulation signals, analog modulation signals, etc., depending on the type of modulation. And further, the signals are classified into digital signals, analog signals and the like according to different signal types.
In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.
In some embodiments, communicator 220 may be a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver.
In some embodiments, the detector 230 may be used to collect signals of or interact with the external environment, may include an optical receiver and a temperature sensor, etc.
The light receiver can be used for acquiring a sensor of the intensity of the ambient light, and adaptively adjusting display parameters and the like according to the intensity of the ambient light; the temperature sensor may be used to sense an ambient temperature so that the display device 200 may adaptively adjust a display color temperature of an image, such as when the ambient temperature is high, a color temperature colder tone of the image displayed by the display device 200 may be adjusted, or when the ambient temperature is low, a color temperature warmer tone of the image displayed by the display device 200 may be adjusted.
In some embodiments, the detector 230 may further include an image collector, such as a camera, a video camera, etc., which may be used to collect external environmental scenes, collect attributes of a user or interact with a user, adaptively change display parameters, and recognize a user gesture to realize an interaction function with the user.
In some embodiments, the detector 230 may also include a sound collector or the like, such as a microphone, that may be used to receive the user's sound. For example, a voice signal including a control instruction for a user to control the display apparatus 200, or an acquisition environment sound for recognizing an environment scene type, so that the display apparatus 200 can adapt to an environment noise.
In some embodiments, external device interface 240 may include, but is not limited to, the following: any one or more interfaces such as a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), an analog or data high-definition component input interface, a composite video input interface, a universal serial bus (Universal Serial Bus, USB) input interface, an RGB port, or the like may be used, or the interfaces may form a composite input/output interface.
As shown in fig. 3, the controller 250 may include at least one of a central processor, a video processor, an audio processor, a graphic processor, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), and a first interface to an nth interface for input/output. Wherein the communication bus connects the various components.
In some embodiments, the controller 250 may control the operation of the display device and respond to the user's operations through various software control programs stored on an external memory. For example, a user may input a user command through a graphical user interface (Graphic User Interface, GUI) displayed on the display 260, the user input interface receiving the user input command through the graphical user interface, or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizing the sound or gesture through the sensor.
A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface, which refers to a user interface related to computer operations that is displayed in a graphical manner. The control can comprise at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget (short for Widget) and other visual interface elements.
In some embodiments, RAM may be used to store temporary data for the operating system or other on-the-fly programs; ROM may be used to store instructions for various system starts, for example, may be used to store instructions for a basic input output system, referred to as a basic input output system (Basic Input Output System, BIOS) start. ROM can be used to complete the power-on self-test of the system, the initialization of each functional module in the system, the driving program of the basic input/output of the system and the booting of the operating system.
In some embodiments, upon receipt of the power-on signal, the display device 200 power begins to boot and the central processor runs the system boot instructions in ROM, copying temporary data of the operating system stored in memory into RAM for booting or running the operating system. When the starting of the operating system is completed, the CPU copies the temporary data of various application programs in the memory into the RAM, and then the temporary data are convenient for starting or running the various application programs.
In some embodiments, the central processor may be configured to execute operating system and application instructions stored in memory, and to execute various applications, data, and content in accordance with various interactive instructions received from external inputs, to ultimately display and play various audio-visual content.
In some example embodiments, the central processor may include a plurality of processors. The plurality of processors may include one main processor and one or more sub-processors. A main processor for performing some operations of the display apparatus 200 in the pre-power-up mode and/or displaying a picture in the normal mode. One or more sub-processors for one operation in a standby mode or the like.
In some embodiments, the video processor may be configured to receive an external video signal, perform video processing in accordance with standard codec protocols for input signals, decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, transparency settings, image composition, etc., and may result in a signal that is displayed or played on the directly displayable device 200.
In some embodiments, the video processor may include a demultiplexing module, a video decoding module, an image compositing module, a frame rate conversion module, a display formatting module, and the like.
The demultiplexing module is used for demultiplexing the input audio/video data stream, such as input moving picture expert group standard 2 (Moving Picture Experts Group-2, MPEG-2), and demultiplexes the input audio/video data stream into video signals, audio signals and the like; the video decoding module is used for processing the demultiplexed video signal, including decoding and scaling, transparency setting, etc.
And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display. The frame rate conversion module is configured to convert the input video frame rate, for example, converting the 60Hz frame rate into the 120Hz frame rate or the 240Hz frame rate, and the common format is implemented in an inserting frame manner. The display format module is used for converting the received frame rate into a video output signal, and changing the video output signal to a signal conforming to the display format, such as outputting an RGB data signal.
In some embodiments, the audio processor may be configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processes to obtain a sound signal that may be played in a speaker.
In some embodiments, the video processor may comprise one or more chips. The audio processor may also comprise one or more chips. Meanwhile, the video processor and the audio processor may be a single chip, or may be integrated with the controller in one or more chips.
In some embodiments, the interface for input/output may be used for audio output, that is, receiving the sound signal output by the audio processor under the control of the controller 250 and outputting the sound signal to an external device such as a speaker, and may output the sound signal to an external sound output terminal of the generating device of the external device, except for the speaker carried by the display device 200 itself, for example: external sound interface or earphone interface, etc. The audio output may also include a near field communication module in the communication interface, such as: and the Bluetooth module is used for outputting sound of a loudspeaker connected with the Bluetooth module.
In some embodiments, the graphics processor may be used to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor may include an operator to display various objects according to display attributes by receiving user input of various interactive instructions to perform operations. And a renderer for rendering the various objects obtained by the arithmetic unit, wherein the rendered objects are used for being displayed on a display.
In some embodiments, the graphics processor and the video processor may be integrated or may be separately configured, where the integrated configuration may perform processing of graphics signals output to the display, and the separate configuration may perform different functions, such as a graphics processor (Graphics Processing Unit, GPU) +frame frequency conversion technology (Frame Rate Conversion, FRC) architecture, respectively.
The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.
In some embodiments, the display 260 may be used to display a user interface, such as a display interface that may be used to display an application. The display interface is any page in the application program, for example, the display interface may be an interface corresponding to a certain channel in the application program.
In some embodiments, the display 260 may be used to receive audio and video signals output by the audio processor and video processor, display video content and images, play audio of the video content, and display components of a menu manipulation interface.
In some embodiments, the display 260 may be used to present a user-operated UI interface generated in the display device 200 and used to control the display device 200.
In some embodiments, the display device 200 may establish control signal and data signal transmission and reception between the communicator 220 and the external control device 100 or the content providing device.
In some embodiments, the memory may include storage of various software modules for driving the display device 200. Such as: various software modules stored in the first memory, including: at least one of a base module, a detection module, a communication module, a display control module, a browser module, various service modules, and the like.
The base module is a bottom software module for communicating signals between the various hardware in the display device 200 and sending processing and control signals to the upper module. The detection module is used for collecting various information from various sensors or user input interfaces and carrying out digital-to-analog conversion and analysis management.
For example, the voice recognition module may include a voice parsing module and a voice command database module. The display control module can be used for controlling the display to display the image content and can be used for playing the multimedia image content, the UI interface and other information. And the communication module can be used for carrying out control and data communication with external equipment. And the browser module can be used for executing data communication between the browsing servers. And the service module is used for providing various services and various application programs. Meanwhile, the memory may also store images of various items in various user interfaces, visual effect patterns of the focus object, and the like, which receive external data and user data.
In some embodiments, the user interface may be used to receive control device 100, such as: an infrared control signal transmitted by an infrared remote controller, etc.
The power supply source may supply power to the display device 200 through power input from an external power source under the control of the controller 250.
In some embodiments, the display device 200 may receive user input operations through the communicator 220. For example, when communicator 220 is a touch component, the touch component may together with display 260 form a touch screen. On the touch screen, a user can input different control instructions through touch operation, for example, the user can input touch instructions such as clicking, sliding, long pressing, double clicking and the like, and different touch instructions can represent different control functions.
To implement the different touch actions, the touch assembly may generate different electrical signals when the user inputs the different touch actions, and transmit the generated electrical signals to the controller 250. The controller 250 may perform feature extraction on the received electrical signal to determine a control function to be performed by the user based on the extracted features.
For example, when a user enters a click touch action at any program icon location in the display interface of an application, the touch assembly 220 will sense the touch action and thereby generate an electrical signal. After receiving the electrical signal, the controller 250 may determine the duration of the level corresponding to the touch action in the electrical signal, and recognize that the user inputs the click command when the duration is less than the preset time threshold. The controller 250 then extracts the location features generated by the electrical signals to determine the touch location. When the touch position is within the application icon display range, it is determined that a click touch instruction is input by the user at the application icon position. Accordingly, the clicking touch instruction is used for executing the application program corresponding to the running program chart or switching the function of the display interface in the current scene, that is, the controller 250 may start the running application program or control the display interface in the switching application program.
In some embodiments, the display device 200 may be controlled by voice instructions. For example, on a touch screen, a user may input a voice command through a touch operation, such as the user may trigger a voice input operation on the display 260 through a voice trigger gesture, and a receiver may be used to receive the voice command input by the user after triggering the voice input operation.
In some embodiments, the communicator 220 may also be an external control component, such as a mouse, remote control, or the like, which may establish a communication connection with a display device. When the user performs different control operations on the external control component, the external control component may generate different control signals in response to the control operations of the user and transmit the generated control signals to the controller 250. The controller 250 may perform feature extraction on the received control signal to determine a control function to be performed by the user according to the extracted features.
For example, when a user clicks a left mouse button at any position in a display interface of an application program through the external control component, the external control component can sense a control action to generate a control signal. After receiving the control signal, the controller 250 may control the stay time of the action at the position according to the control signal, and identify that the click command is input by the user through the external control component when the stay time is less than the preset time threshold. The clicking instruction is used for opening a corresponding interface of the application program or switching a display interface in the application program under the current scene, or controlling the position of a mouse cursor.
For another example, when a user presses a voice key on the remote control, the remote control may initiate a voice entry function, the voice command entered by the user is received by the receiver, and during the user entry of the voice command, the remote control may synchronize the voice command to the display 260, at which time the display 260 may display a voice entry identification to indicate that the user is entering the voice command.
In some embodiments, the communicator 220 may also be a control component, such as a desktop computer, coupled to the display 260, which may be a keyboard coupled to the display. The user can input different control instructions such as clicking, switching and other operation instructions through the keyboard.
Illustratively, the user may input a click command, a switch command, etc. through the corresponding shortcut key. For example, the user may trigger a sliding operation by selecting the "Tab" key and the direction key, i.e., when the user selects the "Tab" key and the direction key on the keyboard at the same time, the controller 250 may receive the key signal, determine that the user triggers an operation of performing a switching operation in a direction corresponding to the direction key, and then, the controller 250 may control the display interface to perform a page turning or scrolling to display a corresponding content page.
Correspondingly, the user can also input voice instructions through corresponding shortcut keys. For example, when the user selects the "Ctrl" key and the "V" key, the controller 250 may receive a key signal, determine that the user triggers a voice input operation, and then the receiver may receive a voice command input by the user and control the display 260 to perform a corresponding operation, such as displaying an interface indicated by the voice command, etc., according to the voice command.
Fig. 3 is a schematic diagram of a voice interaction scenario provided in an embodiment of the present invention, as shown in fig. 3, when a user starts speaking in a far-field MIC (Microphone) 310 and inputs a voice command, an audio buffer of the far-field MIC310 may receive and temporarily store the voice command input by the user. The cloud server 320 may receive the voice command input by the user in real time, and recognize the received voice command by using a streaming voice recognition technology to obtain a voice text. In this process, the cloud server 320 does not need to wait for the user to speak and begin recognition, but gradually outputs the recognition result of the voice command along with the speaking progress of the user. Meanwhile, the cloud server 320 may send the voice text of the recognized voice command to the display device 200, so that the display device 200 may synchronously display the voice text of the voice command, so that the user can learn the voice receiving state of the display device 200 when performing voice control, and the waiting time of the user is reduced.
When the user finishes speaking, i.e., completes the input of a voice command, the far-field MIC310 may stall for a period of time during which no voice command is input. The cloud server 320 may perform semantic recognition on the received voice command by using a natural language processing technology, determine a feedback command of the voice command, and return the feedback command to the display device 200, so that when the feedback command is received, the display device 200 may perform a corresponding operation according to the feedback command, for example, display a target interface or set a corresponding control item.
In order to improve the efficiency and accuracy of speech recognition, fig. 4 shows a flowchart of a speech processing method according to an embodiment of the present invention, where the speech processing method may be applied to the display device 200 shown in fig. 1, and the display device 200 may include a display, a receiver, and a controller coupled to the display and the receiver, respectively.
Wherein the display may be configured to display a user interface, the receiver may be configured to receive voice instructions input by a user, and the controller may analyze the voice instructions and control the display device to perform corresponding operations.
According to the voice processing method applied to the display device, the display device can detect the pause time of the voice command when receiving the voice command input by the user, determine the tail point continuity detection result of the voice command according to the audio signal power of the voice command when the pause time is greater than the pause threshold, and send an integrity detection request to the server when the tail point continuity detection result indicates that the tail point of the voice command is a non-pause tail point, so that the server detects the semantic integrity of the voice command by adopting a pre-trained semantic integrity model, performs semantic analysis on the voice command when the voice command is a complete voice command, and determines the feedback command of the voice command based on the analysis result; then, the display device can receive the feedback instruction sent by the server and control the display to display a target interface corresponding to the feedback instruction.
By the aid of the scheme, the display equipment can detect the pause time of the voice command in the process of receiving the voice command, and the display equipment can determine the tail point continuity detection result of the voice command according to the audio signal power of the voice command under the condition that the pause time is greater than the pause threshold value, so that preliminary judgment of the voice command integrity is achieved.
For a detailed description of the voice processing method provided by the embodiment of the present invention, referring to fig. 4, the controller may be configured to perform the following steps S410 to S440:
step S410: and detecting the pause time of the voice command when the voice command is received.
The voice command may be sound data acquired by the display device. When a user inputs sound data through a voice input function of the display device or an external control component of the display device, such as a remote controller, a microphone and the like, the controller can receive the sound data to obtain a voice command.
In some embodiments, the voice instructions may also be voice data acquired by other means. For example, when the user selects a default voice command provided by the trigger display device, the voice command is the default voice command. For another example, the voice command may be voice data recorded in advance by the user, voice data downloaded in advance from a network, or the like.
In the process of inputting the voice command, the controller can continuously receive the voice command, and meanwhile, the controller can detect whether the received voice command has a user speaking pause or not and the pause time when the user speaking pause occurs.
For example, the controller may analyze and process the received voice command, such as preprocessing the voice command, including removing noise, enhancing the voice signal, and the like, and then divide the preprocessed voice signal into small audio frames, such as dividing the preprocessed voice signal into a plurality of audio frames according to a fixed length of every 10ms, and performing feature extraction on each audio frame, including extracting acoustic feature parameters such as a fundamental frequency, a spectral envelope, a formants, and the like of the audio frame. After feature extraction is completed, the pause and pause duration can be detected by analyzing the change of the feature parameters, for example, calculating the energy difference or zero crossing rate difference between the last two adjacent frames, considering that the pause occurs when the energy or zero crossing rate is lower than a preset threshold, and calculating the pause duration according to the number of frames of the pause and the sampling rate of the audio frames.
By detecting the pause time of the voice command, the input state of the voice command can be monitored, the speaking rhythm of the user is determined, and support is provided for determining whether the voice command is input subsequently.
Step S420: and under the condition that the pause duration is longer than the pause threshold value, determining a tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction.
Wherein the pause threshold is determined based on acoustic feature parameters of the user. The audio signal power of a voice command refers to the energy transmitted by the voice signal per unit time.
To improve accuracy of integrity detection of voice instructions, in some embodiments, the stall threshold may include a first threshold and a second threshold, and the first threshold is less than or equal to the second threshold, and the controller may perform the following method:
and under the condition that the pause duration is smaller than the first threshold value, determining that the integrity detection result indicates that the voice instruction is an incomplete voice instruction.
And under the condition that the pause duration is greater than the second threshold value, determining that the integrity detection result indicates that the voice command is a complete voice command.
The first threshold is time consuming for inputting a word when the user inputs a voice command, and the value range is generally 200 ms-300 ms, for example, 200ms, 250ms, etc. When a pause occurs in the voice command input process by the user and the duration of the pause is smaller than a first threshold value, the user is indicated to have a high possibility of not completing the voice command input, so that the integrity detection result of the voice command can be directly determined to indicate that the voice command is an incomplete voice command.
The second threshold is used for detecting whether the user completes mute waiting time of voice input, when detecting that the user pauses when inputting voice instructions and the duration of the pause is longer than the second threshold, the user can be determined to have completed inputting the voice instructions, so that in order to reduce the waiting time of the user, the integrity detection result of the voice instructions can be directly determined to indicate that the voice instructions are complete voice instructions.
Since the speech signal is an analog signal that varies from the waveform of sound, in the above method, both the first threshold and the second threshold can be determined based on acoustic characteristic parameters of the user, such as intonation, fundamental frequency, intensity of sound, spectral envelope, formants and tones, phonemes, etc. The first threshold may be determined by extracting acoustic feature parameters of a voice command, determining a gender and age of the user based on the acoustic feature parameters of the user, and determining the first threshold based on a time taken for a user in a group of users having the same gender and age to input a word. For example, the average time taken for a user to enter a word in a group of users of the same gender and age group may be determined as the first threshold.
The second threshold value can be determined by extracting acoustic characteristic parameters of the voice command, determining the speech speed, the volume and the like of the user according to the acoustic characteristic parameters of the user, and further determining the mute waiting time for the user to complete voice input, namely determining the second threshold value. Since the first threshold is the time consuming for the user to input a word, the second threshold is the mute waiting time waiting for the user to complete the voice input, and the second threshold may generally be greater than or equal to the first threshold.
Because on the display equipment such as the smart phone, the television, the smart sound box and the like, a user can wake up the voice control function of the display equipment by inputting a wake-up instruction. In order to improve the detection efficiency of the voice command, the controller may analyze the sounding characteristics of the user in the stage of the user inputting the wake-up command, so as to determine detection parameters such as the first threshold value and the second threshold value.
Thus, for example, referring to FIG. 5, the controller may also perform the following method:
step S510: and before receiving the voice instruction, receiving a wake-up instruction input by a user.
When the user wakes up the voice control function of the display device by inputting the wake-up instruction, the controller may receive the wake-up instruction input by the user. The display device does not respond to the user's voice until the user inputs a wake-up instruction.
Step S520: and extracting the voiceprint feature vector of the wake-up instruction, matching the voiceprint feature vector with the voiceprint feature vector of a preset voiceprint feature vector library, and determining acoustic feature parameters corresponding to the wake-up instruction.
Wherein the acoustic feature parameter may include at least one of a speech rate parameter, a tone intensity parameter, and a tone parameter. The speech speed parameter refers to a speech speed or a speech speed degree parameter when a user inputs a voice command, and can be, for example, the number of voices or syllables sent by the user in a unit time. The sound intensity parameter refers to an energy parameter of the sound to identify the loudness or energy of the sound. The tone parameter refers to a characteristic parameter describing the pitch of a sound. The preset voiceprint feature vector library is a voiceprint feature vector library generated according to historical wake-up instructions and historical voice instructions of a plurality of users, and has a mapping relation with voiceprint feature parameters, such as speech speed parameters, voice intensity parameters, tone parameters and the like.
When a user inputs a wake-up instruction, the controller can analyze the wake-up instruction, for example, extract a voiceprint feature vector corresponding to the wake-up instruction, match the voiceprint feature vector with a voiceprint feature vector of a preset voiceprint feature vector library, determine a matched feature vector corresponding to the voiceprint feature vector, and then determine a voiceprint feature parameter corresponding to the wake-up instruction according to a mapping relation between the matched feature vector and the voiceprint feature parameter.
Step S530: and determining a first threshold based on the acoustic characteristic parameters corresponding to the wake-up instruction.
After determining acoustic feature parameters corresponding to the wake-up instruction, such as a speech rate parameter, a sound intensity parameter, a tone parameter, and the like, the controller may analyze the sounding feature of the user accordingly to determine a first threshold. For example, the controller may calculate an average time consuming user input of a word based on the pace parameter to determine the first threshold.
Step S540: and adjusting the first threshold according to the acoustic characteristic parameter library corresponding to the user to obtain a second threshold.
The acoustic feature parameter library corresponding to the user may be an acoustic feature parameter library extracted according to a historical voice instruction and/or a historical wake instruction of the user, for example, a speech speed parameter sequence, a sound intensity parameter sequence, a tone parameter sequence, and the like of the user may be determined according to the historical voice instruction of the user, so as to obtain the acoustic feature parameter library corresponding to the user.
For example, referring to fig. 6, by collecting voice instructions input by a plurality of similar users at historical moments, historical voice instructions of the plurality of users are obtained, the historical voice instructions of the users are uploaded to a cloud server, mel characteristic analysis is performed by the cloud server, voiceprint feature vectors of the historical voice instructions are extracted, and the voiceprint feature vectors are matched with a preset feature vector database, so that acoustic feature parameters corresponding to the historical voice instructions, such as speech speed parameters, voice intensity parameters and the like, are obtained.
Then, when the user speaks the wake-up word, the controller can acquire the wake-up instruction, extract the voiceprint feature vector of the wake-up instruction, and determine the speech speed parameter of the user according to the voiceprint feature vector. Finally, the controller may adjust the first threshold according to the speech rate parameter of the user to determine the second threshold.
For example, the controller may determine, according to an acoustic feature parameter library corresponding to the user, a parameter value obtained by rounding up a first threshold in a speech rate parameter sequence of the user, and determine the parameter value as a second threshold. For example, assuming that the first threshold is 245ms, a value obtained by rounding up 245ms is found in the speech rate parameter sequence, for example, assuming that the speech rate parameter sequence is {230,240,260,270}, the value obtained by rounding up the first threshold is 260ms, that is, the second threshold is 260ms.
By the method, the controller can determine the threshold value parameter for detecting the integrity of the voice command at the stage of the user inputting the wake-up command, and can directly judge whether the pause time is greater than the pause threshold value according to the threshold value parameter when the relationship between the pause time and the pause threshold value is judged subsequently, and the threshold value parameter is determined according to the acoustic characteristic parameter of the user, so that the voice characteristic of the user can be adapted, and the accuracy of detecting the integrity of the voice command is improved.
Researchers of the invention find that in the voice interaction process, the voice acoustic characteristics of the voice command which is not input at the break point are often different from those of other parts, for example, the voice command which is input normally by a user is more coherent, and when the voice command enters a thinking state or forgets words suddenly, the voice command is stopped temporarily. Fig. 7 shows a sound spectrum diagram of a voice command provided by the embodiment of the invention, as shown in fig. 7, when a user normally inputs a voice command "i want to see today's news", sound spectrum data presented by the voice command is relatively consistent, i.e. the intervals between words are relatively uniform, and the intervals are relatively short, and when a user inputs "i want to see" and then inputs "today's news" after a pause time, sound spectrum data of the voice command between "i want to see" and "today's news" presents a blank for a period of time, i.e. sound spectrum data corresponding to the voice command is discontinuous.
Based on this, in the process of inputting the voice command by the user, if the pause time is longer than the pause threshold, it is indicated that the user may have completed inputting the voice command, in order to determine whether the user has actually completed inputting the voice command, the controller may analyze the tail point of the voice command according to the audio signal power of the voice command, for example, detect the tail point continuity of the voice command, i.e. determine whether the power change of the tail point of the voice command in a period of time is continuous or discontinuous, so as to obtain the tail point continuity detection result of the voice command. The controller may determine the audio signal power of the voice command by calculating the voice signal amplitude of the voice command, and determine whether the power change of the tail point of the voice command is continuous or discontinuous in a later period of time according to the change amplitude of the audio signal power of the voice command.
Illustratively, referring to FIG. 8, the controller may determine the tail point continuity detection result of the voice command by:
step S810: and calculating the audio signal power increment between the last two adjacent audio signal frames according to the audio signal power of the voice command.
The audio signal frames of the voice command may be divided and determined according to a preset frame number, for example, every 5 signal frames may be divided into a group to obtain a plurality of audio signal frames.
During the receiving process of the voice command, the controller can continuously receive the voice command, and when the pause duration is detected to be greater than the pause threshold value, begin to calculate the audio signal power change of the tail point of the voice command, such as calculating the audio signal power increment of the last two audio signal frames. For example, referring to fig. 9, the controller may initiate a dual threshold method power detection, i.e., set two thresholds or thresholds to define a target interval, and detect the audio signal power of a voice command within the target interval, trigger an audio signal power envelope calculation when determining that the audio signal energy of the voice command is attenuated into the tail end, i.e., calculate the total power of each frame of the audio signal in the frequency domain, determine the maximum and minimum values of the audio signal power between the last two audio signal frames, and then determine the difference between the maximum and minimum values between the two audio signal frames as the audio signal power increment between the two audio signal frames.
Step S820: and determining that the tail point continuity detection result indicates that the tail point of the voice command is discontinuous under the condition that the power increment of the audio signal is larger than a first increment threshold value.
The first increment threshold may be customized by an operator according to a requirement, or may be adaptively set according to a magnitude of a power change of an audio signal of a user, or the like.
When the audio signal power increment is larger than the first increment threshold, the audio signal power change amplitude between the last two audio signal frames is larger, so that the tail point continuity detection result of the voice command can be determined to indicate that the tail point of the voice command is discontinuous, and the user can be in a normal speaking state. As shown in fig. 9, if the audio signal power increment between the last two adjacent audio signal frames is greater than the first increment threshold, it is indicated that the power of the tail point of the voice command is steeply reduced, and the tail point continuity detection result of the voice command indicates that the tail point of the voice command is discontinuous, and the tail point of the voice command is a non-stop tail point.
Step S830: and under the condition that the power increment of the audio signal is smaller than a second increment threshold value, determining that the tail point continuity detection result indicates the tail point continuity of the voice command.
Wherein the second delta threshold is less than or equal to the first delta threshold. Correspondingly, the second increment threshold can be customized by an operator according to the requirement, and can be adaptively set according to the amplitude of the audio signal power change of the user and the like.
When the audio signal power increment is smaller than the second increment threshold, the audio signal power change amplitude between the last two audio signal frames is smaller, so that the tail point continuity detection result of the voice command can be determined to indicate the tail point continuity of the voice command, and the user can be in a waiting state or a thinking state. As shown in fig. 9, if the audio signal power increment between the last two adjacent audio signal frames is smaller than the second increment threshold, it is indicated that the power of the tail point of the voice command is gradually reduced, and the tail point continuity detection result of the voice command indicates that the tail point of the voice command is continuous, and the tail point of the voice command is a long tail pause tail point.
By the method, the controller can determine the tail point continuity of the voice command according to the power change of the audio signal of the voice command, the integrity of the voice command is analyzed, the time consumption of the process is very little, and the detection speed is high.
In some embodiments, the controller may perform the following method:
And determining that the integrity detection result indicates that the voice command is an incomplete voice command under the condition that the tail point continuity detection result indicates that the tail point of the voice command is discontinuous.
In the case where it is determined that the end point continuity detection result indicates that the end point of the voice instruction is continuous, the step of transmitting the integrity detection request to the server is performed.
When it is determined that the tail point continuity detection result of the voice command indicates that the tail point of the voice command is discontinuous, the audio signal power of the tail point of the voice command is suddenly reduced, the tail point of the voice command is a non-stop tail point, and the integrity detection result of the voice command indicates that the voice command is an incomplete voice command. In contrast, when it is determined that the end-point continuity detection result of the voice command indicates that the end-point of the voice command is continuous, the audio signal power of the end-point of the voice command is slowly reduced, the end-point of the voice command is a long-end stop end point, and the user may enter a "thinking" stage, at which time, to further determine whether the voice command is input, the controller may execute step S430 to send an integrity detection request to the server, and the server detects the integrity of the voice command.
Fig. 10 shows a spectrum example of audio signal power provided by the embodiment of the present invention, as shown in fig. 10, when a user speaks a voice command of "i want to see the catwalk world cup", the voice command of "i want to see the catwalk world cup" presents two different spectrum envelopes of the first example and the second example according to the speaking speed of the user, the interval between words, and the like, wherein when the user speaks "see" the word, the audio signal power of "see" presents a characteristic of slow decrease (the first example) and abrupt decrease (the second example), respectively. According to the method shown in the above steps S810 to S820, it may be determined that when the user speaks "see" the tail is a long tail pause tail in the first example, and is a non-pause tail in the second example.
By the method, the controller can determine whether the voice command is complete according to the tail point continuity detection result of the voice command, once the voice command is determined to be the incomplete voice command, the voice command can be continuously received without carrying out semantic analysis on the voice command through the server, and the detection process is less in time consumption, so that the detection efficiency of the voice command can be improved.
Step S430: and sending an integrity detection request to the server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, and carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command of the voice command based on an analysis result.
The integrity detection request is a request for detecting the semantic integrity of the voice command, and may include information such as identification of the voice command and a request timestamp. The semantic integrity model is a model for detecting whether a voice instruction is semantically complete and reasonable, and can detect grammar correctness, logic consistency, information accuracy, context consistency and the like of the voice instruction. Semantic parsing is the conversion of voice instructions into a form that can be interpreted by a machine.
The tail points of the voice instruction may be divided into a pause tail point and a non-pause tail point according to the tail point continuity of the voice instruction. When the voice command is used, people often leave some short pauses among different commands, for example, the long-tail pause tail point can be included, if the tail point of the voice command is the pause tail point, the tail point of the voice command is determined to belong to the pause tail point in the normal voice input process, and the voice command may not be input. The non-stop tail point refers to a position in the voice command not belonging to the stop tail, and if the tail point of the voice command is a non-stop tail point, the voice command may have completed inputting.
When the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, in order to further determine the integrity of the voice command, the controller may send an integrity detection request to the server, so that the server detects the integrity of the voice command by adopting a pre-trained semantic integrity model, determines whether the semantics of the voice command are complete, and when the voice command is determined to be the complete voice command, may determine an analysis result and a feedback command corresponding to the analysis result based on the semantic analysis of the voice command.
By the method, when the display equipment cannot determine whether the voice command is input, a detection request is sent to the server, so that the integrity of the voice command is detected through the server, and the detection accuracy of the voice command can be improved.
In some embodiments, the display device may further perform the following method:
and when the integrity detection result is determined to indicate that the voice command is an incomplete voice command, receiving a continuation command of the voice command, and determining the integrity detection result of the voice command according to the continuation command.
Wherein the continuation instruction is a subsequent instruction to the received voice instruction. Since the integrity detection result indicates that the voice command is an incomplete voice command, which indicates that the user may not have completed inputting the voice command, the controller may continue to receive a continuation command of the voice command and detect whether the voice command is complete again according to the received continuation command and the voice command.
Step S440: and receiving a feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction.
When receiving the feedback instruction sent by the server, the controller can control the display device to execute the feedback instruction so as to display a target interface corresponding to the feedback instruction on the display. For example, when receiving a feedback instruction sent by the server, the controller may determine an operation indicated by the feedback instruction, control the display to open or switch to a target interface corresponding to the feedback instruction, and display corresponding content in the target interface.
According to the voice processing method applied to the display device, the display device can detect the pause time of the voice command in the process of receiving the voice command, and under the condition that the pause time is larger than the pause threshold value, the tail point continuity detection result of the voice command is determined according to the audio signal power of the voice command, so that preliminary judgment of the voice command integrity is realized.
Fig. 11 is a flowchart illustrating another voice processing method according to an embodiment of the present invention, which may be applied to the server 400 shown in fig. 1, and the server 400 may include a controller.
The controller can analyze the voice command and control the display device to execute corresponding operation. For example, the controller may receive the voice command, perform text conversion on the voice command, and return the obtained voice text to the display device, so that the voice text corresponding to the voice command is synchronously displayed.
According to the voice processing method applied to the server, when receiving the integrity detection request sent by the display device, the server can respond to the integrity detection request, detect the semantic integrity of the voice command by adopting a pre-trained semantic integrity model, and under the condition that the voice command is a complete voice command, perform semantic analysis on the voice command, determine a feedback command of the voice command based on the analysis result, and send the feedback command to the display device so that the display device controls the display to display a target interface corresponding to the feedback command.
By applying the scheme, the man-machine interaction function based on voice can be realized, the semantic integrity of a voice instruction is detected by adopting a pre-trained semantic integrity model, and the accuracy and reliability of voice detection can be improved.
As shown in fig. 11, the controller is configured to perform the following steps S1110 to S1140:
step S1110: and receiving an integrity detection request sent by the display equipment.
The integrity detection request is a request for detecting the semantic integrity of the voice command, and may include information such as identification of the voice command and a request timestamp.
Step S1120: in response to the integrity detection request, the semantic integrity of the voice instruction is detected by adopting a pre-trained semantic integrity model.
When the display device receives the voice command, the integrity detection request is generated under the condition that the voice command pause duration is detected to be longer than the pause threshold value, the tail point continuity detection result of the voice command is determined according to the audio signal power of the voice command, the tail point of the voice command is indicated to be a non-pause tail point, and the pause threshold value is determined according to the acoustic characteristic parameters of a user.
The semantic integrity model refers to a classification model for detecting whether semantic information of a voice text is complete or ambiguous, and may include any semantic analysis model, such as a recurrent neural network detection model, a GPT (generated Pre-Trained Transformer, pre-training transducer model) detection model, and the like.
When the display device detects that the pause duration of the voice command is greater than the pause threshold, the display device can determine a tail point continuity detection result of the voice command according to the audio signal power of the voice command, and when the tail point continuity detection result indicates that the tail point of the voice command is a non-pause tail point, the voice command is described as being 'suspected' complete, and in order to determine whether the voice command is actually input to be completed, the display device can send an integrity detection request to the server.
When the server receives the integrity detection request, a pre-trained semantic integrity model can be adopted to detect the semantic integrity of the voice instruction. Because the semantic integrity model is trained by a large amount of voice text data, the semantic integrity of the voice instruction can be determined by utilizing the pre-trained semantic integrity model to carry out semantic integrity detection on the voice text, and higher detection accuracy can be provided.
In some embodiments, referring to fig. 12, the controller of the server may also perform any one or more of the following methods:
step S1210: and matching the voice command with the voice command in the complete voice command library, and determining that the voice command is the complete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the complete voice command library.
Step S1220: and matching the voice command with the voice command in the incomplete voice command library, and determining that the voice command is the incomplete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the incomplete voice command library.
The complete voice command library is a command library formed by complete voice commands, and the incomplete voice command library is a command library formed by incomplete voice commands. The complete voice command library and the incomplete voice command library can be obtained by sorting according to the historical voice commands of the user, for example, the historical voice commands of the user can be collected, the complete historical voice commands are divided into the complete voice command library according to the integrity of each historical voice command, and the incomplete historical voice commands are divided into the incomplete voice command library, so that the complete voice command library and the incomplete voice command library are obtained. Fig. 13 is a schematic diagram of a voice command library according to an embodiment of the present invention, where, as shown in fig. 13, a complete voice command library may include an interface word database, such as "turn off", "turn off a tv", "return", etc., and an incomplete voice command library may include "i want to see", "please play", etc.
In order to determine whether the semantics of the voice command are complete, the server may match the voice command with the voice commands in the complete voice command library, and in the case that it is determined that the voice command matches any voice command in the complete voice command library, it is determined that the integrity detection result of the voice command indicates that the voice command is a complete voice command. Meanwhile, the server can also match the voice command with the voice command in the incomplete voice command library, and under the condition that the voice command is matched with any voice command in the incomplete voice command library, the integrity detection result of the voice command is determined to indicate that the voice command is the incomplete voice command.
By the method, the voice command can be quickly matched by using the command library, and whether the voice command is complete or not can be determined. When the integrity of the voice command is determined through the command library, the voice command does not need to be detected through a pre-trained semantic integrity model, so that the detection efficiency of the voice command can be improved.
In some embodiments, the controller of the server may also perform the following method:
training the initial semantic integrity model by using a training data set to generate the pre-trained semantic integrity model.
Wherein the initial semantic integrity model is an untrained semantic integrity model. The training dataset is a dataset for training an initial semantic integrity model.
In order to improve performance of the semantic integrity model, the server may train the initial semantic integrity model by using the training data set in advance, and when detection performance of the initial semantic integrity model is better, if the detection accuracy is greater than the accuracy threshold, generate a pre-trained semantic integrity model.
In some embodiments, referring to fig. 14, the controller of the server may also perform the following method:
step S1410: and acquiring a real voice instruction set, and segmenting the high-frequency voice instruction in the real voice instruction set to obtain a plurality of sub-voice instructions.
The real voice instruction set can be a real voice instruction data set of the user acquired through the display equipment side.
According to the user use frequency of each voice instruction in the real voice instruction set, the high-frequency voice instructions can be screened out. The server may then split the high frequency voice command, for example, as shown with reference to fig. 15, the high frequency voice command "i want to watch the movie of Liu Xiaoming" may be split into "i want to watch" and "Liu Xiaoming movies".
Step S1420: and analyzing each sub-voice instruction by adopting a semantic recognition model, and determining a semantic recognition result corresponding to each sub-voice instruction.
The semantic recognition model is a natural language processing model, can effectively recognize the meaning contained in a voice instruction, and can be used for deeply knowing sentences and the content in the sentences and analyzing the real intention of a user.
And analyzing each sub-voice instruction by utilizing the semantic recognition model, and determining the user intention corresponding to each sub-voice instruction, namely a semantic recognition result.
Step S1430: when the semantic recognition result corresponding to any one of the sub-voice instructions is different from the semantic recognition result of the high-frequency voice instruction, marking any one of the sub-voice instructions as a first category, wherein the first category is used for indicating that any one of the sub-voice instructions is an incomplete voice instruction.
When the semantic recognition result corresponding to any one of the sub-voice instructions is different from the semantic recognition result of the high-frequency voice instruction, the meaning that the any one of the sub-voice instructions does not include an explicit intention direction compared with the high-frequency voice instruction is described, so that the any one of the sub-voice instructions can be marked as the first category.
As shown in fig. 15, the semantic recognition result of the sub-voice instruction "i want to see" is different from the semantic recognition result of "i want to see the movie of Liu Xiaoming", so the sub-voice instruction "i want to see" is labeled as the first category.
Step S1440: when the semantic recognition result corresponding to any one of the sub-voice commands is the same as the semantic recognition result of the high-frequency voice command, marking any one of the sub-voice commands as a second class, wherein the second class is used for indicating that any one of the sub-voice commands is a complete voice command.
When the semantic recognition result corresponding to any one of the sub-voice instructions is the same as the semantic recognition result of the high-frequency voice instruction, the meaning of the any one of the sub-voice instructions is clear compared with the high-frequency voice instruction, so that the any one of the sub-voice instructions can be marked as a second category.
As shown in fig. 15, the semantic recognition result of the sub-voice instruction "movie Liu Xiaoming" is the same as the semantic recognition result of "movie i want to see Liu Xiaoming", so the sub-voice instruction "movie Liu Xiaoming" is labeled as the second category.
Further, in some embodiments, as shown in fig. 15, after labeling of each sub-voice command is completed, sub-voice commands with a higher frequency of occurrence, for example, a frequency of occurrence greater than 500 times/week, may be used as the training data set according to the frequency of occurrence of each sub-voice command.
By the method, the training data set can be generated according to the real instruction data set, and the labels of the voice instructions in the training data set are set, so that the effectiveness of the training data set and the accuracy of the semantic integrity model can be improved.
Step S1130: under the condition that the voice command is a complete voice command, carrying out semantic analysis on the voice command, and determining a feedback command of the voice command based on an analysis result.
For example, as shown in fig. 16, when the voice command is determined to be a complete voice command, it is explained that the voice command has a complete semantic meaning, so the server may perform semantic parsing on the voice command, determine a parsing result of the voice command, and determine a feedback command of the voice command according to the parsing result.
In contrast, when it is determined that the voice command is an incomplete voice command, the server may send the integrity detection result to the display device, so that the display device may continue to receive a continuation command of the voice command input by the user until the voice command reception is completed, and continue to execute step S1130.
Step S1140: and sending a feedback instruction to the display equipment so that the display equipment controls the display to display a target interface corresponding to the feedback instruction.
The target interface is an interface indicated by the feedback instruction.
After determining the feedback instruction of the voice instruction, the server may send the feedback instruction to the display device, so that the display device may execute the feedback instruction, for example, display a target interface corresponding to the feedback instruction.
In some embodiments, as shown in fig. 17, when the server detects the integrity of the voice command, in order to increase the detection speed, the following method may be performed:
step S1710: and matching the voice command with the voice command in the complete voice command library, determining that the matching is successful when the voice command is determined to be matched with any voice command in the complete voice command library, executing step S1750, and determining that the matching is failed when each voice command in the complete voice command library is determined to be not matched with the voice command, executing step S1720.
Step S1720: and matching the voice command with the voice command in the incomplete voice command library, determining that the matching is successful when determining that the voice command is matched with any voice command in the incomplete voice command library, executing step S1740, and determining that the matching is failed when determining that each voice command in the incomplete voice command library is not matched with the voice command, executing step S1730.
Step S1730: and detecting the semantic integrity of the voice command by adopting a pre-trained semantic integrity model, and determining the integrity detection result of the voice command. If the integrity detection result of the voice command indicates that the voice command is a complete voice command, step S1740 is performed. If the integrity detection result of the voice command indicates that the voice command is an incomplete voice command, a waiting command is sent to the display device, so that the display device can continuously receive the voice command.
If the integrity detection result of the voice command indicates that the voice command is an incomplete voice command, which indicates that the user may not have spoken the voice command, the server may execute step S1750, where a wait command is sent to the display device, so that the display device waits for the user to continue inputting the voice command until the voice signal ends, e.g., after receiving no voice command input by the user for a long period of time, to continue executing the steps shown in fig. 17.
Step S1740: and determining a feedback instruction of the voice instruction, and sending the feedback instruction to the display equipment so that the display equipment controls the display to display a target interface corresponding to the feedback instruction.
Step S1750: and sending a waiting instruction to the display device so that the display device waits for the user to continuously input the voice instruction until the voice signal is ended.
In summary, according to the voice processing method applied to the server provided in this embodiment, when the display device determines that the voice command may be a complete voice command, a pre-trained semantic integrity model is adopted to perform semantic integrity detection on the voice command, and when the voice command is the complete voice command, a feedback command of the voice command is determined based on a semantic analysis result of the voice command, and the feedback command is returned to the display device, so that the display device can execute an operation indicated by the feedback command, a man-machine interaction function based on voice can be realized, and by detecting the semantic integrity of the voice command by adopting the pre-trained semantic integrity model, the accuracy and reliability of voice detection can be improved.
Fig. 18 shows a flowchart of yet another voice processing method according to an embodiment of the present invention, as shown in fig. 18, may include the following steps:
step S1801: the display device receives a voice instruction input by a user.
The voice command input by the user is received by the display device, and simultaneously the voice command can be synchronously sent to the server, so that the server can recognize the voice text of the voice command and return the voice text to the display device, so that the display device synchronously displays the voice text of the voice command, and the user can conveniently learn the voice receiving state of the display device.
Step S1802: the display device performs signal processing on the voice command and determines the pause time of the voice command.
For example, the display device may perform preprocessing on the received voice command, such as removing noise, enhancing a voice signal, and the like, and then perform feature extraction on the preprocessed voice signal, including extracting acoustic feature parameters such as a fundamental frequency, a spectral envelope, and a formant of the voice signal. And calculating the energy difference or zero crossing rate difference between the last two adjacent frames of the voice signal, considering that the pause occurs when the energy or zero crossing rate is lower than a preset threshold value, and calculating the pause duration according to the number of frames of the pause and the sampling rate of the audio frames.
Step S1803: the display device determines whether the pause duration of the voice instruction is less than a pause threshold. If yes, step S1804 is performed, and if not, step S1805 is performed.
When the display device determines that the pause duration of the voice command is less than the pause threshold, it indicates that the user is highly likely to have not completed the input of the voice command, so step S1804 may be performed, where it is determined that the integrity detection result of the voice command indicates that the voice command is an incomplete voice command.
Step S1804: the display device determines that the integrity detection result of the voice instruction indicates that the voice instruction is an incomplete voice instruction.
When the voice command is determined to be an incomplete voice command, the display device can continuously receive the voice command input by the user, and meanwhile, send the voice command to the server to enable the server to perform voice recognition, such as recognizing voice texts corresponding to the voice command, and return the voice texts to the display device, so that the display device can synchronously display the voice texts of the voice command, but at the moment, the server cannot perform integrity detection of the voice command.
Step S1805: and the display equipment determines the tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction.
For example, the display device may calculate an audio signal power increment between the last two adjacent audio signal frames according to the audio signal power of the voice command, and determine that the tail point continuity detection result indicates that the tail point of the voice command is discontinuous and is a non-stop tail point if the audio signal power increment is greater than the first increment threshold; and under the condition that the power increment of the audio signal is smaller than a second increment threshold value, determining that the tail point continuity detection result indicates that the tail points of the voice instruction are continuous and are stop tail points. Wherein the second delta threshold is less than or equal to the first delta threshold.
Step S1806: and under the condition that the tail point continuity detection result indicates that the tail point of the voice instruction is a non-stop tail point, the display equipment sends an integrity detection request to the server so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice instruction, and the integrity detection result of the voice instruction is determined.
When the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, the voice command is suspected to be a complete voice command, and in order to further determine the integrity of the voice command, the display device may send an integrity detection request to the server, so that the server detects the integrity of the voice command by adopting a pre-trained semantic integrity model, and determines the integrity detection result of the voice command.
The integrity detection result may include that the voice command is any one of a complete voice command, an incomplete voice command and a fuzzy voice command.
Step S1807: and under the condition that the integrity detection result of the voice command indicates that the voice command is a complete voice command, the server performs semantic analysis on the voice command, and determines a feedback command of the voice command based on the analysis result.
When the voice command is determined to be a complete voice command, the server can conduct semantic analysis on the voice command by utilizing the semantic recognition model to determine the operation intention of the user, so that a feedback command of the voice command is generated according to the analysis result of the semantic analysis.
Step S1808: the server sends a feedback instruction to the display device to enable the display device to control the display to execute the operation indicated by the feedback instruction.
For example, the feedback instruction may instruct the display device to control the display to display a corresponding target interface, or may instruct the display device to set a corresponding control parameter, such as volume, brightness, or the like, or may instruct the display device to play a song, perform voice interaction, or the like.
In fact, by the voice processing method provided by the embodiment, voice instruction integrity detection in three stages can be realized, as shown in fig. 19, that is, in the first stage, the display device performs preliminary detection on the voice instruction integrity by using a pause threshold determined based on the voiceprint feature parameters of the user, in determining that the voice instruction is suspected to be complete, tail point continuity of the voice instruction is detected by using the display device, and in determining that the voice instruction is complete according to the detection result, finally, in determining that the voice instruction is suspected to be complete, in further performing semantic integrity detection on the voice instruction by using a server through a third stage by using a pre-trained semantic integrity model. If the voice command is determined to be an incomplete command in any stage, the next stage is not required to be executed, so that the efficiency of detecting the integrity of the voice command can be improved, the detection accuracy can be improved, and the voice interaction experience of a user and display equipment can be improved through the whole detection link.
An embodiment of the present invention provides a computer readable storage medium storing at least one executable instruction that, when executed by a processor, implements a speech processing method in any of the above-described method embodiments.
In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present invention are not directed to any particular programming language.
In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (13)

1. A display device, the display device comprising:
a display configured to display a user interface;
a receiver configured to receive a voice instruction input by a user;
a controller coupled to the display and the receiver, respectively, the controller configured to:
detecting the pause time of the voice command when the voice command is received;
determining a tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction under the condition that the pause duration is greater than a pause threshold value, wherein the pause threshold value is determined according to the acoustic characteristic parameters of the user;
sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command of the voice command based on the analysis result;
and receiving the feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction.
2. The display device of claim 1, wherein the pause threshold comprises a first threshold and a second threshold, the first threshold being less than or equal to the second threshold; the controller is configured to:
determining that the integrity detection result indicates that the voice instruction is an incomplete voice instruction under the condition that the pause duration is smaller than the first threshold value;
and under the condition that the pause duration is greater than the second threshold value, determining that the integrity detection result indicates that the voice instruction is a complete voice instruction.
3. The display device of claim 2, wherein the controller is further configured to:
before receiving the voice instruction, receiving a wake-up instruction input by the user;
extracting a voiceprint feature vector of the wake-up instruction, matching the voiceprint feature vector with a voiceprint feature vector of a preset voiceprint feature vector library, and determining an acoustic feature parameter corresponding to the wake-up instruction, wherein the acoustic feature parameter comprises at least one of a speech rate parameter, a voice intensity parameter and a tone parameter;
determining the first threshold based on acoustic characteristic parameters corresponding to the wake-up instruction;
And adjusting the first threshold according to the acoustic characteristic parameter library corresponding to the user to obtain the second threshold.
4. The display device of claim 1, wherein the controller is configured to:
according to the audio signal power of the voice command, calculating the audio signal power increment between the last two adjacent audio signal frames;
determining that the tail point continuity detection result indicates that the tail point of the voice command is discontinuous under the condition that the audio signal power increment is larger than a first increment threshold;
and under the condition that the power increment of the audio signal is smaller than a second increment threshold, determining that the tail point continuity detection result indicates the tail point continuity of the voice command, wherein the second increment threshold is smaller than or equal to the first increment threshold.
5. The display device of claim 4, wherein the controller is configured to:
determining that the integrity detection result indicates that the voice instruction is an incomplete voice instruction under the condition that the tail point continuity detection result indicates that the tail point of the voice instruction is discontinuous;
and executing the step of sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail points of the voice command are continuous.
6. The display device of any one of claims 1-5, wherein the controller is further configured to:
and when the integrity detection result is determined to indicate that the voice command is an incomplete voice command, receiving a continuation command of the voice command, and determining the integrity detection result of the voice command according to the continuation command.
7. A server, the server comprising:
a controller configured to:
receiving an integrity detection request sent by display equipment;
responding to the integrity detection request, and detecting the semantic integrity of the voice instruction by adopting a pre-trained semantic integrity model;
under the condition that the voice command is a complete voice command, carrying out semantic analysis on the voice command, and determining a feedback command of the voice command based on an analysis result;
the feedback instruction is sent to the display equipment, so that the display equipment controls a display to display a target interface corresponding to the feedback instruction;
the integrity detection request is generated when the display device detects that the pause duration of the voice command is greater than a pause threshold value when receiving the voice command, the tail point continuity detection result of the voice command is determined to indicate that the tail point of the voice command is a non-pause tail point according to the audio signal power of the voice command, and the pause threshold value is determined according to acoustic characteristic parameters of a user.
8. The server of claim 7, wherein the controller is further configured to:
matching the voice command with the voice command in the complete voice command library, and determining that the voice command is the complete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the complete voice command library; and/or the number of the groups of groups,
and matching the voice command with the voice command in the incomplete voice command library, and determining that the voice command is the incomplete voice command according to the integrity detection result under the condition that the voice command is matched with any voice command in the incomplete voice command library.
9. The server of claim 7, wherein the controller is further configured to:
training the initial semantic integrity model by using a training data set to generate the pre-trained semantic integrity model.
10. The server of claim 9, wherein the controller is further configured to:
acquiring a real voice instruction set, and segmenting a high-frequency voice instruction in the real voice instruction set to obtain a plurality of sub-voice instructions;
Analyzing each sub-voice instruction by adopting a semantic recognition model, and determining a semantic recognition result corresponding to each sub-voice instruction;
when the semantic recognition result corresponding to any one sub-voice instruction is different from the semantic recognition result of the high-frequency voice instruction, marking the any one sub-voice instruction as a first class, wherein the first class is used for indicating the any one sub-voice instruction as an incomplete voice instruction; or,
when the semantic recognition result corresponding to any one sub-voice instruction is the same as the semantic recognition result of the high-frequency voice instruction, marking the any one sub-voice instruction as a second class, wherein the second class is used for indicating that the any one sub-voice instruction is a complete voice instruction.
11. A method of speech processing, for application to a display device, the method comprising:
detecting the pause time of a voice instruction when receiving the voice instruction input by a user;
determining a tail point continuity detection result of the voice instruction according to the audio signal power of the voice instruction under the condition that the pause duration is greater than a pause threshold value, wherein the pause threshold value is determined according to the acoustic characteristic parameters of the user;
Sending an integrity detection request to a server under the condition that the tail point continuity detection result indicates that the tail point of the voice command is a non-stop tail point, so that the server adopts a pre-trained semantic integrity model to detect the semantic integrity of the voice command, carrying out semantic analysis on the voice command under the condition that the voice command is a complete voice command, and determining a feedback command of the voice command based on the analysis result;
and receiving the feedback instruction sent by the server, and controlling the display to display a target interface corresponding to the feedback instruction.
12. A method of speech processing, for application to a server, the method comprising:
receiving an integrity detection request sent by display equipment;
responding to the integrity detection request, and detecting the semantic integrity of the voice instruction by adopting a pre-trained semantic integrity model;
under the condition that the voice command is a complete voice command, carrying out semantic analysis on the voice command, and determining a feedback command of the voice command based on an analysis result;
the feedback instruction is sent to the display equipment, so that the display equipment controls a display to display a target interface corresponding to the feedback instruction;
The integrity detection request is generated when the display device detects that the pause duration of the voice command is greater than a pause threshold value when receiving the voice command, the tail point continuity detection result of the voice command is determined to indicate that the tail point of the voice command is a non-pause tail point according to the audio signal power of the voice command, and the pause threshold value is determined according to acoustic characteristic parameters of a user.
13. A computer readable storage medium, characterized in that at least one executable instruction is stored in the storage medium, which when executed by a processor, implements the operations of the speech processing method according to claim 11 or 12.
CN202311104802.2A 2023-08-30 2023-08-30 Display device, server, voice processing method, and storage medium Pending CN117809646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311104802.2A CN117809646A (en) 2023-08-30 2023-08-30 Display device, server, voice processing method, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311104802.2A CN117809646A (en) 2023-08-30 2023-08-30 Display device, server, voice processing method, and storage medium

Publications (1)

Publication Number Publication Date
CN117809646A true CN117809646A (en) 2024-04-02

Family

ID=90426236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311104802.2A Pending CN117809646A (en) 2023-08-30 2023-08-30 Display device, server, voice processing method, and storage medium

Country Status (1)

Country Link
CN (1) CN117809646A (en)

Similar Documents

Publication Publication Date Title
KR102209519B1 (en) Display apparatus for performing a voice control and method therefor
EP2752764B1 (en) Display apparatus and method for controlling the display apparatus
CN112511882B (en) Display device and voice call-out method
EP2725576A1 (en) Image processing apparatus and control method thereof and image processing system.
CN112163086B (en) Multi-intention recognition method and display device
CN113763958B (en) Voice wakeup method, voice wakeup device, electronic equipment and storage medium
CN112489691A (en) Electronic device and operation method thereof
CN112102828A (en) Voice control method and system for automatically broadcasting content on large screen
CN112599126B (en) Awakening method of intelligent device, intelligent device and computing device
CN112188249B (en) Electronic specification-based playing method and display device
CN114464180A (en) Intelligent device and intelligent voice interaction method
CN114118064A (en) Display device, text error correction method and server
CN112182196A (en) Service equipment applied to multi-turn conversation and multi-turn conversation method
CN112492390A (en) Display device and content recommendation method
CN113066491A (en) Display device and voice interaction method
EP3640937B1 (en) Electronic apparatus and controlling method thereof
CN115602167A (en) Display device and voice recognition method
CN117809646A (en) Display device, server, voice processing method, and storage medium
KR20220143622A (en) Electronic apparatus and control method thereof
CN112256232B (en) Display device and natural language generation post-processing method
CN113079400A (en) Display device, server and voice interaction method
CN111914565A (en) Electronic equipment and user statement processing method
CN113035194B (en) Voice control method, display device and server
CN117891517A (en) Display equipment and voice awakening method
WO2022193735A1 (en) Display device and voice interaction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination