CN117896564A

CN117896564A - Display equipment, voice instruction-based audio recognition method and device

Info

Publication number: CN117896564A
Application number: CN202311864331.5A
Authority: CN
Inventors: 郭绪兵
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-16

Abstract

The embodiment of the invention relates to the technical field of intelligent terminals, and discloses a display device, an audio recognition method and a device based on voice instructions, wherein the display device comprises: a controller configured to: in the process of playing the multimedia resource, responding to the voice instruction, and determining a target time range; intercepting audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified; identifying the audio fragment to be identified to obtain an audio identification result; and controlling the display to display the recognition result interface according to the audio recognition result. By applying the technical scheme of the invention, the convenience in searching and identifying the audio information of the multimedia resource can be improved, and the accuracy of the audio identification result can be improved.

Description

Display equipment, voice instruction-based audio recognition method and device

Technical Field

The invention relates to the technical field of intelligent terminals, in particular to a display device, and an audio recognition method and device based on voice instructions.

Background

When a user views a movie via a display device (e.g., television, cell phone, computer, etc.), the user may want to know what the background music is playing in the movie. Typically, a user may perform a text search through a web page in other display devices, for example, entering and searching in a web page: "what the background music the master station plays at the bow of the ship" in Taitannik number; or, the user can identify the background music played at the moment through the related music identification application in other display devices, meanwhile, the user needs to open the music identification application in other devices, adjust the playing progress of the movie and television drama to the position with the background music, and finally wait for the identification result.

The above method for searching background music is complicated, resulting in poor user experience. Therefore, there is a need for further improvements in the art for methods of searching and identifying background music.

Disclosure of Invention

The embodiment of the invention provides a display device, an audio recognition method and an audio recognition device based on voice instructions, which can improve convenience in searching and recognizing audio information of multimedia resources and can improve accuracy of audio recognition results.

According to an aspect of an embodiment of the present invention, there is provided a display apparatus including: a display configured to display a user interface; a communicator configured to receive a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource; a controller coupled to the display and the communicator, respectively, and configured to: in the process of playing the multimedia resources, responding to the voice instruction and determining a target time range; intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified; identifying the audio fragment to be identified to obtain an audio identification result; and controlling the display to display a recognition result interface according to the audio recognition result.

In some embodiments, the controller is specifically configured to: processing the voice command and determining time information in the voice command; determining the target time range according to the type of the time information; wherein the type of the time information includes at least one of a relative time type, an absolute time type, and a blur time type, and the minimum value of the target time range is greater than 0 and the maximum value is less than or equal to the total duration of the audio data.

In some embodiments, the controller is specifically configured to: if the type of the time information is the relative time type, acquiring the current playing time of the audio data and the total duration of the audio data; determining the target time range according to the current playing time of the audio data and the time information; wherein, the minimum value of the target time range is larger than 0, and the maximum value is smaller than or equal to the current playing time; or, the minimum value of the target time range is greater than the current playing time, and the maximum value is less than or equal to the total duration of the audio data.

In some embodiments, the controller is specifically configured to: if the type of the time information is the absolute time type and the time information is a time range, determining the target time range according to the time range; if the type of the time information is an absolute time type and the time information is a time point, determining the target time range according to the time point and a preset adjustment value.

In some embodiments, the controller is specifically configured to: if the type of the time information is the fuzzy time type, acquiring the current playing time of the audio data; and determining the target time range according to the current playing time of the audio data and a preset adjusting value.

In some embodiments, the controller is specifically configured to: performing audio separation processing on the audio fragments to be identified to obtain a first audio fragment in the audio fragments to be identified; and identifying the first audio fragment to obtain the audio identification result.

In some embodiments, the controller is specifically configured to: searching at least one second audio fragment matched with the first audio fragment in a first preset database; if the at least one second audio fragment is found, determining the association degree between each second audio fragment in the at least one second audio fragment and the multimedia resource; and determining the audio information corresponding to the second audio fragment with the highest association degree of the multimedia resource as the audio recognition result.

In some embodiments, the controller is further configured to: searching in a second preset database according to the resource information of the multimedia resource and the target time range; the second preset database comprises a plurality of corresponding relations among preset resource information, audio time ranges and audio information; if the second preset database comprises target resource information which is the same as the resource information of the multimedia resource, and the audio time range corresponding to the target resource information is at least partially overlapped with the target time range, determining the audio information corresponding to the target resource information as the audio recognition result; the controller is specifically configured to: and if the target resource information does not exist in the second preset database or the audio time range corresponding to the target resource information is not overlapped with the target time range, intercepting the audio data according to the target time range to obtain the audio fragment to be identified.

In some embodiments, the controller is further configured to: responding to a playing instruction input by a user to an audio playing control corresponding to a target audio name on the identification result interface, and playing audio data corresponding to the target audio name; the audio recognition result comprises at least one audio name and an audio playing control corresponding to each audio name, and the at least one audio name comprises the target audio name.

According to still another aspect of the embodiments of the present invention, there is provided an audio recognition method based on a voice command, which is applied to a display device, the method including: receiving a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource; in the process of playing the multimedia resources, responding to the voice instruction and determining a target time range; intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified; identifying the audio fragment to be identified to obtain an audio identification result; and controlling the display to display a recognition result interface according to the audio recognition result.

According to still another aspect of the embodiments of the present invention, there is provided an audio recognition apparatus based on a voice command, configured to a display device, the apparatus including: a receiving module for: receiving a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource; a determining module for: in the process of playing the multimedia resources, responding to the voice instruction and determining a target time range; an interception module for: intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified; an identification module for: identifying the audio fragment to be identified to obtain an audio identification result; a control module for: and controlling the display to display a recognition result interface according to the audio recognition result.

According to yet another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when executed on a display device, causes the display device to perform the operations of the voice instruction based audio recognition method as described above.

According to the display device, the audio recognition method and the audio recognition device based on the voice command, which are provided by the embodiment of the invention, the display device can respond to the voice command in the process of playing the multimedia resource, determine the target time range, and intercept the audio data corresponding to the multimedia resource according to the target time range to obtain the audio fragment to be recognized. And then, the audio fragment to be identified is identified, an audio identification result is obtained, and finally, the display is controlled to display an identification result interface according to the audio identification result.

Compared with the technical scheme that the audio information is required to be searched through characters or through music recognition application in other equipment in the prior art, the technical scheme of the invention can realize the search of the audio information by inputting voice instructions into the display equipment without the complicated operation of a user. In addition, the technical scheme of the invention can accurately extract the audio clips to be identified from the played multimedia resources according to the voice command of the user, and identify the audio clips to be identified so as to obtain an audio identification result. Therefore, by applying the technical scheme of the invention, the convenience in searching and identifying the audio information of the multimedia resource can be improved, and the accuracy of the audio identification result can be improved.

Drawings

Fig. 1 shows an interaction schematic diagram of a display device and a control device according to an embodiment of the present invention;

fig. 2 shows a block diagram of a configuration of a control device in an embodiment of the present invention;

fig. 3 is a block diagram showing a hardware configuration of a display device according to an embodiment of the present invention;

FIG. 4 is a flowchart of an audio recognition method based on voice instructions according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an audio recognition method based on voice commands according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another voice command-based audio recognition method according to an embodiment of the present invention;

FIG. 7 shows a flowchart for determining a target time range provided by an embodiment of the present invention;

FIG. 8 is a flowchart of determining an audio recognition result according to an embodiment of the present invention;

FIG. 9 is a flowchart of another method for determining an audio recognition result according to an embodiment of the present invention;

fig. 10 is a schematic diagram showing an audio recognition result according to an embodiment of the present invention;

FIG. 11 is a schematic diagram showing another audio recognition result according to an embodiment of the present invention;

FIG. 12A is a schematic diagram showing still another audio recognition result according to an embodiment of the present invention;

FIG. 12B is a schematic diagram showing still another audio recognition result according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an audio recognition device based on voice command according to an embodiment of the present invention.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

When the user wants to know the audio information (such as background music information) in the multimedia resources played in the display device (such as a television, a mobile phone, a computer and the like), the user can search the audio information through keywords in other display devices, for example, the user can input and search in a webpage: "what the background music the master station of a man and woman plays at the bow of the ship is in the Taitannich number". However, there may be a case where the user cannot learn the related keywords, for example, the current scene may be difficult to describe, or the user may not know what the name of the played multimedia resource is, and at this time, it is difficult to search the audio information through the keywords, and it is also impossible to ensure that the corresponding search result exists in the web page.

Alternatively, the user may identify the background music played at this time in other display devices through an associated music identification application. However, the user needs to open the music recognition application in other devices, adjust the playing progress of the movie and television to the position with background music, and wait for the recognition result finally, so that the whole searching process is complicated, and the searching efficiency and the user experience are greatly reduced.

In view of one or more of the foregoing problems, embodiments of the present invention provide a display device and a voice instruction-based audio recognition method that can be applied to the display device. Fig. 1 shows an interaction schematic diagram of a display device and a control apparatus according to an embodiment of the present invention. As shown in fig. 1, a user may operate the display apparatus 200 through the mobile terminal 300 or the control device 100. The control apparatus 100 may be a remote controller, and the remote controller and the display device 200 may communicate through an infrared protocol, a bluetooth protocol, or the remote controller may control the display device 200 in a wireless or other wired manner.

The user may input user instructions through keys on a remote controller, voice input, a control panel, etc., to control the display device 200. For example, the user may control the display device 200 to switch a displayed page through up-down keys on the remote controller, control the video played by the display device 200 to play or pause through play pause keys, and input a voice command through voice input keys to control the display device 200 to perform a corresponding operation.

In some embodiments, the user may also control the display device 200 using a mobile terminal, tablet, computer, notebook, and other smart device. For example, a user may control the display device 200 through an application installed on the smart device that, by configuration, may provide the user with various controls in an intuitive user interface on a screen associated with the smart device.

In some embodiments, the mobile terminal 300 may implement connection communication with a software application installed on the display device 200 through a network communication protocol for the purpose of one-to-one control operation and data communication. For example, it may be realized that a control instruction protocol is established between the mobile terminal 300 and the display device 200, a remote control keyboard is synchronized to the mobile terminal 300, a function of controlling the display device 200 is realized by controlling a user interface on the mobile terminal 300, or a function of transmitting contents displayed on the mobile terminal 300 to the display device 200 to realize synchronous display is also realized.

As shown in fig. 1, the display device 200 and the server 400 may communicate data in a variety of communication manners, which may allow the display device 200 to be communicatively connected via a local area network (Local Area Network, LAN), a wireless local area network (Wireless Local Area Network, WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. For example, the display device 200 receives software program updates by sending and receiving messages, and electronic program guide (Electrical Program Guide, EPG) interactions, or accesses a remotely stored digital media library. The server 400 may be one cluster or multiple clusters, and may include one or more types of servers.

The display device 200 may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection display device, a smart terminal, such as a mobile phone, a tablet computer, a smart television, a laser projection device, an electronic desktop (electronic table), etc. The specific display device type, size, resolution, etc. are not limited.

Fig. 2 shows a block diagram of a configuration of the control device 100 in an exemplary embodiment of the present invention, and as shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an operation instruction input by a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and may interact with the display device 200.

Taking a display device as an example of a television, fig. 3 shows a hardware configuration block diagram of a display device 200 according to an embodiment of the present invention. As shown in fig. 3, the display device 200 includes: a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, and at least one of a memory, a power supply, and a user interface.

The modem 210 may receive broadcast television signals through a wired or wireless reception manner and demodulate an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 may be used to collect signals of the external environment or interaction with the outside.

In some embodiments, the frequency point demodulated by the modem 210 is controlled by the controller 250, and the controller 250 may issue a control signal according to the user selection, so that the modem responds to the television signal frequency selected by the user and modulates and demodulates the television signal carried by the frequency.

The broadcast television signal may be classified into a terrestrial broadcast signal, a cable broadcast signal, a satellite broadcast signal, an internet broadcast signal, or the like according to different broadcasting systems of the television signal. Or may be differentiated into digital modulation signals, analog modulation signals, etc., depending on the type of modulation. And further, the signals are classified into digital signals, analog signals and the like according to different signal types.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, communicator 220 may be a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver.

In some embodiments, the detector 230 may be used to collect signals of or interact with the external environment, may include an optical receiver and a temperature sensor, etc.

The light receiver can be used for acquiring a sensor of the intensity of the ambient light, and adaptively adjusting display parameters and the like according to the intensity of the ambient light; the temperature sensor may be used to sense an ambient temperature so that the display device 200 may adaptively adjust a display color temperature of an image, such as when the ambient temperature is high, a color temperature colder tone of the image displayed by the display device 200 may be adjusted, or when the ambient temperature is low, a color temperature warmer tone of the image displayed by the display device 200 may be adjusted.

In some embodiments, the detector 230 may further include an image collector, such as a camera, a video camera, etc., which may be used to collect external environmental scenes, collect attributes of a user or interact with a user, adaptively change display parameters, and recognize a user gesture to realize an interaction function with the user.

In some embodiments, the detector 230 may also include a sound collector or the like, such as a microphone, that may be used to receive the user's sound. For example, a voice signal including a control instruction for a user to control the display apparatus 200, or an acquisition environment sound for recognizing an environment scene type, so that the display apparatus 200 can adapt to an environment noise.

In some embodiments, external device interface 240 may include, but is not limited to, the following: any one or more interfaces such as a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), an analog or data high-definition component input interface, a composite video input interface, a universal serial bus (Universal Serial Bus, USB) input interface, an RGB port, or the like may be used, or the interfaces may form a composite input/output interface.

As shown in fig. 3, the controller 250 may include at least one of a central processor, a video processor, an audio processor, a graphic processor, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), and a first interface to an nth interface for input/output. Wherein the communication bus connects the various components.

In some embodiments, the controller 250 may control the operation of the display device and respond to user operations through various software control programs stored on an external memory. For example, a user may input a user command through a graphical user interface (Graphic User Interface, GUI) displayed on the display 260, the user input interface receiving the user input command through the graphical user interface, or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizing the sound or gesture through the sensor, receiving the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface, which refers to a user interface related to computer operations that is displayed in a graphical manner. The control can comprise at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget (short for Widget) and other visual interface elements.

In some embodiments, RAM may be used to store temporary data for the operating system or other on-the-fly programs; ROM may be used to store instructions for various system starts, for example, may be used to store instructions for a basic input output system, referred to as a basic input output system (Basic Input Output System, BIOS) start. ROM can be used to complete the power-on self-test of the system, the initialization of each functional module in the system, the driving program of the basic input/output of the system and the booting of the operating system.

In some embodiments, upon receipt of the power-on signal, the display device 200 power begins to boot and the central processor runs the system boot instructions in ROM, copying temporary data of the operating system stored in memory into RAM for booting or running the operating system. When the starting of the operating system is completed, the CPU copies the temporary data of various application programs in the memory into the RAM, and then the temporary data are convenient for starting or running the various application programs.

In some embodiments, the central processor may be configured to execute operating system and application instructions stored in memory, and to execute various applications, data, and content in accordance with various interactive instructions received from external inputs, to ultimately display and play various audio-visual content.

In some example embodiments, the central processor may include a plurality of processors. The plurality of processors may include one main processor and one or more sub-processors. A main processor for performing some operations of the display apparatus 200 in the pre-power-up mode and/or displaying a picture in the normal mode. One or more sub-processors for one operation in a standby mode or the like.

In some embodiments, the video processor may be configured to receive an external video signal, perform video processing in accordance with standard codec protocols for input signals, decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, transparency settings, image composition, etc., and may result in a signal that is displayed or played on the directly displayable device 200.

In some embodiments, the video processor may include a demultiplexing module, a video decoding module, an image compositing module, a frame rate conversion module, a display formatting module, and the like.

The demultiplexing module is used for demultiplexing the input audio/video data stream, such as input moving picture expert group standard 2 (Moving Picture Experts Group-2, MPEG-2), and demultiplexes the input audio/video data stream into video signals, audio signals and the like; the video decoding module is used for processing the demultiplexed video signal, including decoding and scaling, transparency setting, etc.

And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display. The frame rate conversion module is configured to convert the input video frame rate, for example, converting the 60Hz frame rate into the 120Hz frame rate or the 240Hz frame rate, and the common format is implemented in an inserting frame manner. The display format module is used for converting the received frame rate into a video output signal, and changing the video output signal to a signal conforming to the display format, such as outputting an RGB data signal.

In some embodiments, the audio processor may be configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processes to obtain a sound signal that may be played in a speaker.

In some embodiments, the video processor may comprise one or more chips. The audio processor may also comprise one or more chips. Meanwhile, the video processor and the audio processor may be a single chip, or may be integrated with the controller in one or more chips.

In some embodiments, the interface for input/output may be used for audio output, that is, receiving the sound signal output by the audio processor under the control of the controller 250 and outputting the sound signal to an external device such as a speaker, and may output the sound signal to an external sound output terminal of the generating device of the external device, except for the speaker carried by the display device 200 itself, for example: external sound interface or earphone interface, etc. The audio output may also include a near field communication module in the communication interface, such as: and the Bluetooth module is used for outputting sound of a loudspeaker connected with the Bluetooth module.

In some embodiments, the graphics processor may be used to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor may include an operator to display various objects according to display attributes by receiving user input of various interactive instructions to perform operations. And a renderer for rendering the various objects obtained by the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the graphics processor and the video processor may be integrated or may be separately configured, where the integrated configuration may perform processing of graphics signals output to the display, and the separate configuration may perform different functions, such as a graphics processor (Graphics Processing Unit, GPU) +frame frequency conversion technology (Frame Rate Conversion, FRC) architecture, respectively.

The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, the display 260 may be used to display a user interface, such as may be used to display a corresponding interface of a display device, e.g., the display interface may be a channel search interface in a display device, or may also be a display interface of some application program, etc.

In some embodiments, the display 260 may be used to receive audio and video signals output by the audio processor and video processor, display video content and images, play audio of the video content, and display components of a menu manipulation interface.

In some embodiments, the display 260 may be used to present a user-operated UI interface generated in the display device 200 and used to control the display device 200.

In some embodiments, the display apparatus 200 may establish control signal and data signal transmission and reception between the control device 100 or the content providing apparatus through the communicator 220.

In some embodiments, the memory may include storage of various software modules for driving the display device 200. Such as: various software modules stored in the first memory, including: at least one of a basic module, a detection module, a communication module, a display control module, a browser module, various service modules and the like.

The base module is a bottom software module for communicating signals between the various hardware in the display device 200 and sending processing and control signals to the upper module. The detection module is used for collecting various information from various sensors or user input interfaces and carrying out digital-to-analog conversion and analysis management.

The display control module can be used for controlling the display to display the image content and can be used for playing the multimedia image content, the UI interface and other information. And the communication module can be used for carrying out control and data communication with external equipment. And the browser module can be used for executing data communication between the browsing servers. And the service module is used for providing various services and various application programs. Meanwhile, the memory may also store images of various items in various user interfaces, visual effect patterns of the focus object, and the like, which receive external data and user data.

In some embodiments, the user interface may be used to receive control device 100, such as: an infrared control signal transmitted by an infrared remote controller, etc.

The power supply source may supply power to the display device 200 through power input from an external power source under the control of the controller 250.

In some embodiments, the display device 200 may receive a query instruction input by a user through the communicator 220. For example, when communicator 220 is a touch component, the touch component may together with display 260 form a touch screen. On the touch screen, a user can input different control instructions through touch operation, for example, the user can input touch instructions such as clicking, sliding, long pressing, double clicking and the like, and different touch instructions can represent different control functions.

To implement the different touch actions, the touch assembly may generate different electrical signals when the user inputs the different touch actions, and transmit the generated electrical signals to the controller 250. The controller 250 may perform feature extraction on the received electrical signal to determine a control function to be performed by the user based on the extracted features.

For example, when a user inputs a click touch action at a search location in the display interface, the touch component will sense the touch action to generate an electrical signal. After receiving the electrical signal, the controller 250 may determine the duration of the level corresponding to the touch action in the electrical signal, and recognize that the user inputs the click command when the duration is less than the preset time threshold. The controller 250 then extracts the location features generated by the electrical signals to determine the touch location. When the touch position is within the search position range, it is determined that the user has input a click touch instruction at the search position. Then, the controller 250 may start a media search function and receive a search instruction input by the user, such as a search keyword, a voice search instruction, etc.

In some embodiments, the user may trigger the query operation through a specific gesture operation on the touch screen, for example, when the user performs two continuous double-click operations on the display interface, the controller 250 may determine an interval time between two continuous double-clicks, and when the interval time is less than a preset time threshold, recognize that the user inputs the continuous double-click operation, and determine that the user triggers the media resource search operation.

In some embodiments, a user may enter voice instructions on a touch screen via a touch operation, such as a user may trigger a voice query operation on display 260 via a voice-triggered gesture.

In some embodiments, the communicator 220 may also be an external control component, such as a mouse, remote control, or the like, which may establish a communication connection with a display device. When the user performs different control operations on the external control component, the external control component may generate different control signals in response to the control operations of the user and transmit the generated control signals to the controller 250. The controller 250 may perform feature extraction on the received control signal to determine a control function to be performed by the user according to the extracted features.

For example, when a user clicks a left mouse button at any position in the channel display interface through the external control component, the external control component can sense a control action to generate a control signal. After receiving the control signal, the controller 250 may control the stay time of the action at the position according to the control signal, and identify that the click command is input by the user through the external control component when the stay time is less than the preset time threshold. The clicking instruction is used for triggering an input function instruction of the query instruction or switching the media resource page under the current scene.

For another example, when the user presses a voice key on the remote control, the remote control may initiate a voice entry function, and during the process of the user entering a voice command, the remote control may synchronize the voice command to the display 260, at which time the display 260 may display a voice entry identifier to indicate that the user is entering a voice command.

In some embodiments, the communicator 220 may also be a control component, such as a desktop computer, coupled to the display 260, which may be a keyboard coupled to the display. The user can input different control instructions, such as media information switching instructions, inquiry instructions and the like through the keyboard.

Illustratively, the user may input a click command, a voice command, etc. through the corresponding shortcut key. For example, the user may trigger the sliding operation by selecting the "Tab" key and the direction key, that is, when the user selects the "Tab" key and the direction key on the keyboard at the same time, the controller 250 may receive the key signal, determine that the user triggers the operation of performing the switching operation in the direction corresponding to the direction key, and then, the controller 250 may control to turn or scroll the display interface in the media presentation page to display the corresponding media options.

Correspondingly, the user can also input voice instructions through corresponding shortcut keys. For example, when the user selects the "Ctrl" key and the "V" key, the controller 250 may receive a key signal to determine that the user triggers a voice search operation, and then the controller 250 may receive a voice command input by the user and control the display 260 to perform a corresponding operation, such as displaying a query result page corresponding to the voice command, according to the voice command.

In order to facilitate the detailed description of the voice command based audio recognition method provided by the embodiment of the present invention, fig. 4 shows a flowchart of a voice command based audio recognition method provided by the embodiment of the present invention, and the method may be applied to the display device 200 shown in fig. 1.

Among other things, the display device 200 may include a display 260, a communicator 220, and a controller 250 coupled to the display 260 and the communicator 220, respectively.

In some embodiments, the display 260 may be used to display a user interface, which is an interactive interface of a user with the display device 200, and the user may send instructions to the display device 200 through control operations, such as touch operations, gesture operations, etc., to achieve a certain task. The user interface has perfect interactive design, so that a user can easily complete a task.

In some embodiments, the communicator 220 may be configured to receive voice instructions input by a user, and the controller 250 may analyze the voice instructions and control the display device 260 to perform corresponding operations, such as searching for audio information corresponding to a multimedia asset, and the like.

According to the voice instruction-based audio recognition method provided by the embodiment of the invention, the display equipment can respond to the voice instruction in the process of playing the multimedia resource, determine the target time range, and intercept the audio data corresponding to the multimedia resource according to the target time range to obtain the audio fragment to be recognized. And then, the audio fragment to be identified is identified, an audio identification result is obtained, and finally, the display is controlled to display an identification result interface according to the audio identification result.

By applying the technical scheme of the invention, the search of the audio information can be realized by inputting the voice command into the display equipment; the audio clips to be identified can be accurately extracted from the played multimedia resources according to the voice command of the user, and the audio clips to be identified are identified, so that an audio identification result is obtained. Therefore, the convenience in searching and identifying the audio information of the multimedia resource can be improved, and the accuracy of the audio identification result can be improved.

As shown in fig. 4, the controller 250 is configured to perform the following steps S410 to S440:

s410: in the process of playing the multimedia resource, a target time range is determined in response to the voice command.

Referring to fig. 5, a scene graph of a voice command based audio recognition method is shown.

As shown in fig. 5, a user may view/play multimedia assets (e.g., audio-video, pictures, text, etc.) through a user interface 501 of a display device. In this embodiment, the multimedia resource is audio/video data, for example: movies, television shows, etc.

If a piece of background music appears in the multimedia resource when the multimedia resource is played, and the user wants to know what the background music is (i.e. the information of the background music), the user can input a voice command to the display device to trigger the operation of the display device to search for the information of the background music.

The voice command may be sound data acquired by the display device. When a user inputs sound data through a voice input function of the display device or an external control component of the display device, such as a remote controller, a microphone and the like, the controller can receive the sound data to obtain a voice command.

In some embodiments, the voice instructions may also be voice data acquired by other means. For example, when the user selects a default voice command provided by the trigger display device, the voice command is the default voice command. For another example, the voice command may be voice data recorded in advance by the user, voice data downloaded in advance from a network, or the like.

Illustratively, the voice command input by the user may be: "search for current background music", "what the background music is for the first 20 seconds", "what the background music is for 10 minutes to 12 minutes", and so on.

Referring to fig. 6, there is shown a scene graph of another voice instruction based audio recognition method.

In some embodiments, after the display device responds to the voice instruction, a prompt 601 may be presented in the user interface 501, for example: "searching for current background music", "background music 20 seconds before searching", and the like to prompt the user that the display device has successfully responded to the voice instruction, and that the search for background music is being performed in accordance with the voice instruction.

In some embodiments, referring to fig. 7, the controller 250 may determine the target time range by:

S710: and processing the voice command to determine time information in the voice command.

First, the voice command input by the user can be processed such as voice recognition, text processing, semantic understanding, etc., so as to determine the requirement of the user. For example, if keywords such as "background music", "BGM (Back Ground Music )" are included in the voice command, it may be determined that the user's need is to search for background music. Further, semantic slot extraction can be performed on the voice command, so that the text and time-related words of the voice command are processed, and time information (time slots) in the voice command is obtained.

By way of example, the time information expressed by the user may include the following types: one is the relative time type, i.e. the time information given by the user is a general range based on the current playing time, for example: "first 20 seconds", "10 seconds thereafter", "first 30 seconds", and so forth; the second is an absolute time type, i.e. the time information given by the user is an explicit time range or point in time, for example: "10 th to 16 th seconds", "2 nd 53 th seconds", and so on.

Thus, considering the various types of time information, the text regular matching method may be used in this embodiment to extract the time information, and the regular expression (Regular Expression, regex) template used may be: { \d+ (? }. The regular expressions corresponding to the different types of time information can be shown in table 1:

TABLE 1

Time (time)	Regular expression (regex)	Remarks (note)
			Absolute hours (Absolute_Hour)	D+ (	Hours of
Absolute minutes (absolute_minute)	D+ (	Minute (min)
			Absolute seconds (absolute_second)	D+ (	Second of
Relative minutes (relative_minute)	\d+ (]? Front and rear])))	Minutes (front and back)
			Relative seconds (relative_second)	\d+ (]? Front and rear])))	Second (front and back)

For example, the voice command "what is the background music of the first 20 seconds" is subjected to semantic slot extraction, and the obtained time information is "the first 20 seconds" (namely-20 seconds); after semantic slot extraction is performed on the voice command 'search background music of 10 th to 16 th seconds', the obtained time information is '10 th second' and '16 th second'.

In addition, the type of time information expressed by the user may also include a fuzzy time type, that is, a time range or a time point such as a fraction of a second, or the like is not given by the user, only fuzzy time information is given, for example: "now", "current", "after" and so on. In summary, the types of time information in the voice instruction may include: a relative time type, an absolute time type, a fuzzy time type, etc.

S720: and determining a target time range according to the type of the time information.

Wherein the minimum value of the target time range is greater than 0 and the maximum value is less than or equal to the total duration of the audio data.

The manner of determining the target time range will be described below taking the above-described relative time type, absolute time type, and fuzzy time type as examples.

Taking the type of time information as the relative time type as an example, if the voice command is: "what is the background music of the first 30 seconds", the obtained time information is "the first 30 seconds" (i.e., -30 seconds) after the above semantic slot extraction processing is performed thereon. In this case, the current playing time of the multimedia asset (i.e., the current playing time of the audio data corresponding to the multimedia asset) and the total duration of the multimedia asset (i.e., the total duration of the audio data corresponding to the multimedia asset) need to be acquired. For example, the current play time of the audio data is "35 minutes 20 seconds", that is, "2120 seconds"; the total duration of the audio data is "2 hours 40 minutes", that is, "9600 seconds".

Next, a target time range may be determined according to the current playing time and the time information (2120 s-30 s=2090 s), where: [2090s,2120s ].

It will be appreciated that after the target time frame is obtained, it may be necessary to further process the obtained target time frame to be within a valid range of values.

For example, if the time information is "first 50 seconds" and the current playing time of the audio data is "30 seconds", the obtained target time range is [ -20s,30s ], and [ -20s,0s ] is an invalid time range, so the valid target time range should be: [0s,30s ]; if the time information is "last 50 seconds" and the current playing time of the audio data is "9560 seconds", the obtained target time range is [9560s,9610s ], and if the total duration of the audio data is "9600 seconds", the [9600s,9610s ] is an invalid time range, so the valid target time range should be: [9560s,9600s ].

Based on the above analysis, the minimum value of the finally obtained target time range should be greater than 0, and the maximum value should be less than or equal to the current play time (corresponding to the case where the time information is "previous XX seconds"); alternatively, the minimum value of the target time range should be greater than the current play time, and the maximum value should be less than or equal to the total duration (corresponding to the case where the time information is "XX seconds after").

Taking the type of the time information as a relative time type as an example, it is also possible for the relative time type to include two cases, one being that the time information is a time range and the other being that the time information is a time point.

For example, taking time information as a time range as an example, if the voice command is: the method comprises the steps of searching background music from 10 seconds to 16 seconds, extracting the semantic slots, obtaining time information of 10 seconds and 16 seconds, and determining the target time range according to the time range to be: [10s,16s ].

However, the voice command input by the user may have an incorrect expression, for example, the voice command may be: "search for background music of 18 th to 10 th seconds", therefore, it is also necessary to determine the maximum value and the minimum value of the target time range, and the minimum value of the target time range should be greater than 0 and the maximum value should be less than or equal to the total duration of the audio data. Thus, the target time range corresponding to the voice command is: [10s,18s ].

For example, taking time information as an example, if the voice command is: "search for background music of the 2 nd minute 53 seconds", the semantic slot extraction processing is performed thereon, and the obtained time information is "the 2 nd minute 53 seconds", that is, "the 173 th seconds". However, the target time range cannot be determined only according to the time point "173 th second", and therefore, an adjustment value may be set in advance, and the target time range may be determined according to the preset adjustment value and the time point.

Assuming that the preset adjustment value is 5 seconds, the time point 173 th second is used as the center to extend left and right for 5 seconds, and the obtained target time range is: [168s,178s ].

Likewise, the minimum value of the target time range should be greater than 0 and the maximum value should be less than or equal to the total duration of the audio data.

Taking the type of the time information as a fuzzy time type as an example, for the time information of the fuzzy time type, a target time range can be determined according to a preset adjustment value and the current playing time of the audio data. Likewise, the minimum value of the target time range should be greater than 0 and the maximum value should be less than or equal to the total duration of the audio data.

For example, assume that the current play time of the audio data is "2120 seconds", and the preset adjustment value is 10 seconds. If the voice command is: "what is the current background music", the length of the preset adjustment value can be extended leftwards and rightwards with the current playing time as the center, namely, the target time range is determined as follows: [2110s,2130s ]; if the voice command is: by searching for the background music played in front, the length of the preset adjustment value can be traced back forward with the current playing time as the reference, namely, the target time range is determined as follows: [2110s,2120s ]; if the voice command is: "search for background music after this", the length of the preset adjustment value may be extended backward based on the current playing time, that is, the target time range is determined as follows: [2120s,2130s ].

S420: and intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain the audio fragment to be identified.

It will be appreciated that multimedia assets are typically loaded in the form of streaming media, and that the audio and video streams of the same multimedia asset are separate. Therefore, when the audio data corresponding to the multimedia resource is intercepted according to the target time range, the intercepted audio fragment (i.e. the audio fragment to be identified) can be directly obtained.

For example, assume that the determined target time range is: and 120s,130s, intercepting the audio fragments in the target time range from the audio data corresponding to the multimedia resources to obtain the audio fragments to be identified.

S430: and identifying the audio fragment to be identified to obtain an audio identification result.

The audio piece to be identified can then be identified. It will be appreciated that in addition to background music, audio data is often interspersed with noise, ambient noise, and other interference factors. Therefore, it is first necessary to perform an effective audio separation process on the audio piece to be recognized.

The basic principle of audio separation is to transform background sounds, human sounds, environmental sounds, noise, etc. mixed together in a time domain space into a frequency domain space using a fast fourier transform (Fast Fourier Transform, FFT) or a wavelet transform (Wavelet Transform), separate signal quantities having different spectral characteristics from each other on an f-k spectrum (frequency-wavenumber spectrum), and then extract the frequency domain signal characteristics by a short-time fourier transform (Short Time Fourier Transform, STFT) or a wavelet packet transform (Wavelet Packet Transform, WPT) or the like to obtain effective spectrum information.

In this embodiment, an independent component analysis (Independent Component Analysis, ICA) method in a blind source separation (Blind Signal Separation, BSS) algorithm based on a convolution hybrid model may be employed, and an audio separation process is performed on the audio clip to be identified to extract a valid spectral set of background music (i.e., the first audio clip) from the audio clip to be identified.

It will be appreciated that the core of the independent component analysis is to recover the independent components from the blended data, i.e. it is focused on filtering and dimension reduction, and identify individual single data sources from the blended data, so that some single data sources that are not needed can be deleted, and some single data sources that are needed remain.

Thus, after the audio frequency separation is carried out on the audio frequency fragments to be identified by adopting independent component analysis, a frequency spectrum set of the first audio frequency fragments only containing background music can be obtained.

In addition to the independent component analysis method, automatic speech recognition (Automatic Speech Recognition, ASR) may be used to identify human voices in the audio piece to be identified, thereby removing noisy human voice parts from the audio piece to be identified, resulting in a spectral set of the first audio piece containing only background music.

It should be noted that, the method for performing audio separation on the audio clip to be identified is merely an example, and the method for performing audio separation in the present embodiment is not limited.

In some embodiments, referring to fig. 8, the controller 250 may determine the audio recognition result by:

s810: at least one second audio segment matching the first audio segment is found in a first preset database.

After the first audio segment is obtained, a second audio segment matched with the first audio segment can be searched in a database (such as a first preset database) storing a plurality of audio segments.

S820: and if at least one second audio fragment is found, determining the association degree of each second audio fragment in the at least one second audio fragment with the multimedia resource.

S830: and determining the audio information corresponding to the second audio fragment with the highest association degree of the multimedia resource as an audio identification result.

It will be appreciated that, due to the similarity between the melodies of the partial music, there may be a plurality of second audio pieces matching the first audio pieces, i.e. the audio recognition result may include audio information of the plurality of second audio pieces.

Therefore, the second audio clips in the audio recognition result can be screened and sequenced according to the association degree between the multimedia resource and the second audio clips, so that the optimal audio recognition result is determined and recommended to the user.

For example, the multimedia resource may be classified into types, a plurality of audio dimensions may be set correspondingly, and a weight of the audio dimension corresponding to each type may be set, so as to determine the association degree between the multimedia resource and the second audio clip according to the information.

It will be appreciated that background music in a movie-type multimedia asset will typically be similar to the year of the movie's presentation. For example, in the multimedia resource "Taitannik number", the year of its presentation is 1997, and its background music is likely to be music produced in the same year. Thus, even if a plurality of second audio pieces are searched, the second audio pieces of the same year, i.e., i'm heart forever, are the optimal audio recognition results in terms of time/year dimension.

For a short video type multimedia resource, the background music is most likely to be the current popular music, so that the popular second audio piece in the plurality of second audio pieces can be determined as the optimal audio recognition result.

For a cartoon type multimedia resource, the background music is most likely to be the music of the same region, for example, for a domestic cartoon, a second audio fragment of China can be recommended preferentially; for japanese animation, the second audio clip belonging to japan may be preferentially recommended.

Based on the above analysis, an exemplary type partitioning of multimedia resources and setting of audio dimensions and weights are shown in table 2:

TABLE 2

	Film making apparatus	Japanese cartoon	Domestic cartoon	Game commentary	Short video
						Year is similar	5	3	3	1	4
Hot music	2	2	2	2	5
						The same region	3	5	5	3	1

For example, multimedia resources may be divided into types of movies, japanese cartoons, domestic cartoons, game narratives, and the like; the set audio dimensions may include close year (i.e., whether the difference between the year of the second audio piece and the year of the multimedia asset is less than a certain threshold), popular music (i.e., whether the play amount of the second audio piece is greater than a certain threshold), the same region (i.e., whether the second audio piece is the same region to which the multimedia asset belongs), and the like.

Meanwhile, corresponding weights can be given to the audio dimensions corresponding to each type. For example, the weight value is set from small to large to a value of 1 to a value of 5.

For example, for a "movie" type, the weight of "year close" may be set to 5, the weight of "popular music" may be set to 2, and the weight of "region same" may be set to 3.

For the "japanese animation" and "domestic animation" types, the weight of "year close" may be set to 3, the weight of "popular music" may be set to 2, and the weight of "region same" may be set to 5.

For the "game narrative" type, the weight of "year close" may be set to 1, the weight of "popular music" may be set to 2, and the weight of "region same" may be set to 3.

For the "short video" type, the weight of "year close" may be set to 4, the weight of "hot music" may be set to 5, and the weight of "region same" may be set to 1.

Thus, when determining the optimal audio recognition result from the plurality of second audio segments, the weight of each second audio segment may be determined according to table 2, thereby determining the association degree of the second audio segment and the multimedia resource.

For example, for the multimedia resource a, the type is a movie type, and after identifying and searching the first audio segments, 2 second audio segments are obtained: a second audio piece a and a second audio piece b. If the year of the second audio segment a is the same as the year of the multimedia resource a, the second audio segment a is not popular music, and the second audio segment a is the same as the area of the multimedia resource a, it may be determined that the association degree between the second audio segment a and the multimedia resource a is: 5+2×0+3=8; if the year of the second audio segment b is different from the year of the multimedia resource a, the second audio segment b is not popular music, and the second audio segment b is the same as the area of the multimedia resource a, it may be determined that the association degree between the second audio segment b and the multimedia resource a is: 5×0+2×0+3=3. The second audio segment a and the second audio segment b have the highest association degree with the multimedia resource a, and then the second audio segment a can be determined as the audio recognition result of the first audio segment of the multimedia resource a.

It should be noted that, the above manner of determining the association degree between each second audio segment and the multimedia resource is merely an example, and the present embodiment is not limited thereto.

In the scheme, the time information in the voice command input by the user can be accurately extracted to obtain the target time range, so that the audio data corresponding to the multimedia resource is intercepted based on the target time range to obtain the required audio fragment to be identified. Further, background music in the audio clips to be identified is extracted through independent component analysis and other methods, and a first audio clip with interference noise removed is obtained. After the first audio fragment is identified and matched with a plurality of audio fragments in a first preset database, under the condition that a plurality of identification results (namely a plurality of second audio fragments) are obtained, the association degree between each second audio fragment and the multimedia resource is determined, so that the optimal audio identification result can be determined from the plurality of second audio fragments.

Therefore, the voice recognition function of the display equipment can be triggered by the voice command, so that convenience in searching and recognizing the audio information of the multimedia resource is improved, and the accuracy and recognition efficiency of the audio recognition can be improved.

In some embodiments, referring to fig. 9, the controller may also perform the following method:

s910: searching in a second preset database according to the resource information of the multimedia resource and the target time range.

S920: if the second preset database comprises the same target resource information as the resource information of the multimedia resource, and the audio time range corresponding to the target resource information is at least partially overlapped with the target time range, determining the audio information corresponding to the target resource information as an audio recognition result.

The second preset database comprises a plurality of preset resource information, audio time ranges and corresponding relations among the audio information.

It will be appreciated that the process of audio recognition and searching is generally relatively slow, and especially for long, large data volume audio clips, the process of performing a series of recognition processes and searches is time consuming.

In contrast, the simple text search is much more efficient than the audio recognition search, so a database (e.g., a second preset database) storing historical audio recognition results can be pre-established, and thus, before processing the audio data, whether the audio recognition results matching the target time range are found out through the second preset database.

For example, if a user searches for "background music of about 55 seconds" while watching "a tankyi number", information related to the audio recognition result may be stored in a second preset database, thereby generating a series of index records. Wherein, the information related to the audio recognition result may include: the name of the multimedia resource (i.e., preset resource information), the target time range (i.e., audio time range), the normalization range (the range after normalization processing of the audio time range), and the audio information (i.e., audio information) corresponding to the second audio clip in the audio recognition result, etc.

An exemplary index record in the second preset database is shown in table 3:

TABLE 3 Table 3

Illustratively, if there is a voice instruction entered by the user while looking at the "Taitanrnike number": and searching for background music from 50 seconds to 60 seconds, and extracting time information from the voice instruction to obtain a target time range of [50s,60s ]. Then, the target time range [50s,60s ] and the resource information of the multimedia resource, i.e. the Tatanrnic number, can be matched with the index record in the second preset database, so as to search at least one target resource information identical to the resource information in the preset resource information. As in table 3, the index records that can be found to be eligible include entries 1 and 2.

If at least one target resource information is found, it can be further determined whether the audio time range or the normalization range corresponding to the target resource information at least partially overlaps with the target time range [50s,60s ]. If yes, the audio information corresponding to the target resource information can be determined to be an audio recognition result. As shown in table 3, the normalization range corresponding to the 1 st index record is [50s,70s ], and the normalization range includes the target time range [50s,60s ], so that the audio information corresponding to the target resource information in the 1 st index record, i.e. i'm forever, can be determined as the audio recognition result, and thus the operations of audio separation and audio recognition are not required.

By utilizing the advantages of the large data platform, the more times of similar searching, the more the index records generated in the second preset database are, and the more accurate the obtained audio identification result is. Of course, some well-known multimedia resources and corresponding audio information, audio time ranges, etc. can also be generated into index records through manual operation.

Through the scheme, the audio recognition result can be determined directly based on the time information in the voice command and the second preset database, so that the process of audio separation and audio recognition is omitted, and the efficiency of audio recognition is further improved.

S930: if the target resource information does not exist in the second preset database, or the audio time range corresponding to the target resource information is not overlapped with the target time range, the audio data is intercepted according to the target time range, and the audio fragment to be identified is obtained.

If the index record corresponding to the target time range is not found in the second preset database, that is, the target resource information does not exist in the second preset database, or the audio time range corresponding to the target resource information does not overlap with the target time range, operations such as audio separation and audio identification can be continuously performed.

S440: and controlling the display to display the recognition result interface according to the audio recognition result.

Reference is made to a schematic diagram of displaying the audio recognition result shown in fig. 10.

In some embodiments, after determining the above-described audio recognition results, the audio recognition results may be displayed in a user interface 501 of a display device. As shown in fig. 10, a recognition result interface 1001 may be displayed in the user interface 501, and in the recognition result interface 1001, relevant audio information of the audio recognition result (i.e., the second audio clip) is displayed, for example: the cover picture, audio name, and singer of the second audio clip may also display the audio play control 1002 of the second audio clip, and so on. Thus, the user can view the audio information of the second audio clip in the recognition result interface 1001 and can play the second audio clip through the audio play control 1002.

Referring to another schematic diagram of displaying the audio recognition result shown in fig. 11.

In some embodiments, in addition to the manner of recommending the audio recognition result (i.e. recommending the optimal audio recognition result) shown in fig. 10, the relevancy between the plurality of second audio segments and the multimedia resource may be ordered, and the second audio segments may be recommended in the order of the relevancy from high to low.

As shown in fig. 11, for example, it is assumed that the first audio segment of the multimedia resource B corresponds to 3 second audio segments, including a second audio segment C (audio name: (audio a)), a second audio segment d (audio name: (audio C)), and a second audio segment e (audio name: (audio B)), and that the association degree of the 3 second audio segments with the multimedia resource B is from high to low: a second audio segment c, a second audio segment e, and a second audio segment d. The relevant information of the 3 second audio clips and the corresponding audio playback controls 1002 may be displayed in the above-mentioned order in the recognition result interface 1001. Thus, the user can view the audio information of the plurality of second audio clips in the recognition result interface 1001, and can play the plurality of second audio clips through the audio play control 1002.

In some embodiments, in addition to presenting the audio recognition results to the user, video data associated with the optimal audio recognition results may be recommended to the user. As shown in fig. 11, information of some video data (e.g., names of video data: "video 1", "video 2", and "video 3") related to the second audio clip c (audio a ") and a corresponding video play control 1003 may be displayed in the recognition result interface 1001.

Wherein the video data may include: music Video (MV) of the second audio clip, concert Video, turner Video, movies related thereto, and the like.

Similarly, the association degree between the video data and the second audio piece may be ordered, and the video data may be displayed in order of the association degree from high to low. Thus, the user can view information of video data related to the second audio clip in the recognition result interface 1001, and can play the video data through the video play control 1003.

Referring to fig. 12A and 12B, still another schematic diagram showing the audio recognition result is shown.

In some embodiments, as shown in fig. 12A, the user may play the corresponding second audio clip through the audio play control 1002 in the recognition result interface 1001. That is, the user may input a play instruction to the audio play control 1002 corresponding to the target audio name (e.g., audio A) to the display device, and thus, the display device plays the audio data corresponding to the target audio name in response to the play instruction.

Similarly, as shown in fig. 12B, the user can play the corresponding video data through the video play control 1003 in the recognition result interface 1001. That is, the user can input a play instruction to the video play control 1003 corresponding to the target video name (e.g., video 1) to the display device, and thus the display device plays the video data corresponding to the target video name in response to the play instruction.

The play command may be input by a user through voice, for example, the user inputs a voice command: "play Audio A", "play video 1", etc.; the user may input the input through a control device (such as a remote control, a keyboard, a mouse, etc.), a gesture, or other modes, which is not limited in this embodiment.

In addition, the user can perform more operations on the audio data and the video data, such as 'joining in collection', 'sharing to social software', pushing the audio data and the video data to the terminal device of the user, and the like.

Note that, the above-described manner of displaying the audio recognition result is merely an example, and the present embodiment is not limited thereto.

Through the scheme, the audio identification result can be displayed in the user interface, so that a user can view the audio identification result, and the user can play audio data in the audio identification result in the user interface. In addition, the video data related to the audio recognition result can be displayed together in the user interface, and the user can play the video data in the user interface, so that the richer recognition result can be recommended to the user.

In summary, according to the voice instruction-based audio recognition method provided in the embodiment, the search for audio information may be implemented based on the voice instruction input by the user; and the time information in the voice command input by the user can be accurately extracted to obtain a target time range, so that the audio data corresponding to the multimedia resource is intercepted based on the target time range to obtain the required audio fragment to be identified. Further, background music in the audio clips to be identified is extracted through independent component analysis and other methods, and a first audio clip with interference noise removed is obtained. After the first audio fragment is identified and matched with a plurality of audio fragments in a first preset database, under the condition that a plurality of identification results (namely a plurality of second audio fragments) are obtained, the association degree between each second audio fragment and the multimedia resource is determined, so that the optimal audio identification result can be determined from the plurality of second audio fragments. Therefore, the convenience in searching and identifying the audio information of the multimedia resource can be improved, and the accuracy of the audio identification result can be improved.

In addition, the audio recognition method based on the voice command can directly determine the audio recognition result based on the time information in the voice command and the second preset database, so that the process of audio separation and audio recognition is omitted, and the efficiency of audio recognition is further improved.

In addition, the audio recognition method based on the voice command can display the audio recognition result and the video data related to the audio recognition result in the user interface, and the user can play the audio data and the video data by inputting the related command in the user interface, so that richer recognition results can be recommended for the user, and user experience is improved.

The embodiment of the present invention further provides a voice instruction-based audio recognition apparatus, referring to fig. 13, the voice instruction-based audio recognition apparatus 1300 may be applied to a display device, and the voice instruction-based audio recognition apparatus 1300 may include: a receiving module 1310, a determining module 1320, an intercepting module 1330, an identifying module 1340, and a control module 1350.

A receiving module 1310 for: receiving a voice instruction input by a user; the voice command is used for searching the audio information corresponding to the multimedia resource.

A determining module 1320, configured to: in the process of playing the multimedia resource, a target time range is determined in response to the voice command.

An interception module 1330 for: and intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain the audio fragment to be identified.

An identification module 1340 for: and identifying the audio fragment to be identified to obtain an audio identification result.

A control module 1350 for: and controlling the display to display the recognition result interface according to the audio recognition result.

In some embodiments, the determining module 1320 is specifically configured to: processing the voice command and determining time information in the voice command; determining a target time range according to the type of the time information; wherein the type of the time information includes at least one of a relative time type, an absolute time type, and a blur time type, and the minimum value of the target time range is greater than 0 and the maximum value is less than or equal to the total duration of the audio data.

In some embodiments, the determining module 1320 is specifically configured to: if the type of the time information is the relative time type, acquiring the current playing time of the audio data and the total duration of the audio data; determining a target time range according to the current playing time and time information of the audio data; the minimum value of the target time range is larger than 0, and the maximum value is smaller than or equal to the current playing time; or, the minimum value of the target time range is larger than the current playing time, and the maximum value is smaller than or equal to the total duration of the audio data.

In some embodiments, the determining module 1320 is specifically configured to: if the type of the time information is an absolute time type and the time information is a time range, determining a target time range according to the time range; if the type of the time information is an absolute time type and the time information is a time point, determining a target time range according to the time point and a preset adjustment value.

In some embodiments, the determining module 1320 is specifically configured to: if the type of the time information is a fuzzy time type, acquiring the current playing time of the audio data; and determining a target time range according to the current playing time of the audio data and a preset adjusting value.

In some embodiments, the identification module 1340 is specifically configured to: performing audio separation processing on the audio fragments to be identified to obtain a first audio fragment in the audio fragments to be identified; and identifying the first audio fragment to obtain an audio identification result.

In some embodiments, the identification module 1340 is specifically configured to: searching at least one second audio fragment matched with the first audio fragment in a first preset database; if at least one second audio fragment is found, determining the association degree of each second audio fragment in the at least one second audio fragment with the multimedia resource; and determining the audio information corresponding to the second audio fragment with the highest association degree of the multimedia resource as an audio identification result.

As shown in fig. 13, the voice instruction based audio recognition apparatus 1300 may further include: the lookup module 1360.

In some embodiments, the lookup module 1360 is to: searching in a second preset database according to the resource information of the multimedia resource and the target time range; the second preset database comprises a plurality of corresponding relations among preset resource information, audio time ranges and audio information; the determining module 1320 is further configured to: if the second preset database comprises target resource information which is the same as the resource information of the multimedia resource, and the audio time range corresponding to the target resource information is at least partially overlapped with the target time range, determining the audio information corresponding to the target resource information as an audio recognition result; the identification module 1340 is specifically configured to: if the target resource information does not exist in the second preset database, or the audio time range corresponding to the target resource information is not overlapped with the target time range, the audio data is intercepted according to the target time range, and the audio fragment to be identified is obtained.

As shown in fig. 13, the voice instruction based audio recognition apparatus 1300 may further include: a play module 1370.

In some embodiments, the play module 1370 is for: responding to a playing instruction input by a user on an audio playing control corresponding to the target audio name in the recognition result interface, and playing audio data corresponding to the target audio name; the audio identification result comprises at least one audio name and an audio playing control corresponding to each audio name, and the at least one audio name comprises a target audio name.

Correspondingly, the specific details of each part in the voice command-based audio recognition device are already described in detail in the embodiment of the electronic equipment part, and the details not disclosed can be referred to the embodiment of the electronic equipment part, so that the details are not repeated.

An embodiment of the present invention provides a computer readable storage medium storing at least one executable instruction that, when executed on a display device/voice instruction based audio recognition apparatus, causes the display device/voice instruction based audio recognition apparatus to perform the voice instruction based audio recognition method in any of the above method embodiments.

The executable instructions may in particular be used to cause the display device/voice instruction based audio recognition apparatus to perform the voice instruction based audio recognition method described above.

In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present invention are not directed to any particular programming language.

In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A display device, characterized by comprising:

a display configured to display a user interface;

a communicator configured to receive a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource;

A controller coupled with the display and the communicator, respectively, and configured to:

in the process of playing the multimedia resource, responding to the voice instruction, and determining a target time range;

intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified;

identifying the audio fragment to be identified to obtain an audio identification result;

and controlling the display to display a recognition result interface according to the audio recognition result.

2. The display device of claim 1, wherein the controller is specifically configured to:

processing the voice command and determining time information in the voice command;

determining the target time range according to the type of the time information; wherein the type of the time information includes at least one of a relative time type, an absolute time type, and a blur time type, and the minimum value of the target time range is greater than 0 and the maximum value is less than or equal to the total duration of the audio data.

3. The display device of claim 2, wherein the controller is specifically configured to:

If the type of the time information is the relative time type, acquiring the current playing time of the audio data and the total duration of the audio data;

determining the target time range according to the current playing time of the audio data and the time information; the minimum value of the target time range is larger than 0, and the maximum value is smaller than or equal to the current playing time; or, the minimum value of the target time range is larger than the current playing time, and the maximum value is smaller than or equal to the total duration of the audio data.

4. The display device of claim 2, wherein the controller is specifically configured to:

if the type of the time information is the absolute time type and the time information is a time range, determining the target time range according to the time range;

if the type of the time information is an absolute time type and the time information is a time point, determining the target time range according to the time point and a preset adjustment value.

5. The display device of claim 2, wherein the controller is specifically configured to:

if the type of the time information is the fuzzy time type, acquiring the current playing time of the audio data;

And determining the target time range according to the current playing time of the audio data and a preset adjusting value.

6. The display device of any one of claims 1-5, wherein the controller is specifically configured to:

performing audio separation processing on the audio fragments to be identified to obtain a first audio fragment in the audio fragments to be identified;

and identifying the first audio fragment to obtain the audio identification result.

7. The display device of claim 6, wherein the controller is specifically configured to:

searching at least one second audio fragment matched with the first audio fragment in a first preset database;

if the at least one second audio fragment is found, determining the association degree between each second audio fragment in the at least one second audio fragment and the multimedia resource;

and determining the audio information corresponding to the second audio fragment with the highest association degree of the multimedia resource as the audio recognition result.

8. The display device of any one of claims 1-5, wherein the controller is further configured to:

searching in a second preset database according to the resource information of the multimedia resource and the target time range; the second preset database comprises a plurality of preset resource information, audio time ranges and corresponding relations among the audio information;

If the second preset database comprises target resource information which is the same as the resource information of the multimedia resource, and the audio time range corresponding to the target resource information is at least partially overlapped with the target time range, determining the audio information corresponding to the target resource information as the audio recognition result;

the controller is specifically configured to:

and if the target resource information does not exist in the second preset database or the audio time range corresponding to the target resource information is not overlapped with the target time range, intercepting the audio data according to the target time range to obtain the audio fragment to be identified.

9. The display device of any one of claims 1-5, wherein the controller is further configured to:

responding to a playing instruction input by a user to an audio playing control corresponding to a target audio name on the identification result interface, and playing audio data corresponding to the target audio name; the audio identification result comprises at least one audio name and an audio playing control corresponding to each audio name, and the at least one audio name comprises the target audio name.

10. An audio recognition method based on voice instructions, which is applied to a display device, comprises the following steps:

receiving a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource;

11. An audio recognition apparatus based on voice instructions, which is configured on a display device, comprising:

a receiving module for: receiving a voice instruction input by a user; the voice instruction is used for searching audio information corresponding to the multimedia resource;

a determining module for: in the process of playing the multimedia resource, responding to the voice instruction, and determining a target time range;

an interception module for: intercepting the audio data corresponding to the multimedia resources according to the target time range to obtain an audio fragment to be identified;

An identification module for: identifying the audio fragment to be identified to obtain an audio identification result;

a control module for: and controlling the display to display a recognition result interface according to the audio recognition result.

12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed, implements the voice instruction based audio recognition method of claim 10.